Universidad Autónoma de Madrid

Abstract:Named entity recognition (NER) is the very first step in the linguistic processing of any new domain. It is currently a common process in BioNLP on English clinical text. However, it is still in its infancy in other major languages, as it is the case for Spanish. Presented under the umbrella of the PharmaCoNER shared task, this paper describes a very simple method for the annotation and normalization of pharmacological, chemical and, ultimately, biomedical named entities in clinical cases. The system developed for the shared task is based on limited knowledge, collected, structured and munged in a way that clearly outperforms scores obtained by similar dictionary-based systems for English in the past. Along with this recovering of the knowledge-based methods for NER in subdomains, the paper also highlights the key contribution of resource-based systems in the validation and consolidation of both the annotation guidelines and the human annotation practices. In this sense, some of the authors discoverings on the overall quality of human annotated datasets question the above-mentioned `official' results obtained by this system, that ranked second (0.91 F1-score) and first (0.916 F1-score), respectively, in the two PharmaCoNER subtasks.

Abstract:This paper presents a grammar and style checker demonstrator for Spanish and Greek native writers developed within the project GramCheck. Besides a brief grammar error typology for Spanish, a linguistically motivated approach to detection and diagnosis is presented, based on the generalized use of PROLOG extensions to highly typed unification-based grammars. The demonstrator, currently including full coverage for agreement errors and certain head-argument relation issues, also provides correction by means of an analysis-transfer-synthesis cycle. Finally, future extensions to the current system are discussed.



Abstract:This paper describes work performed withing the CRATER ({\em C}orpus {\em R}esources {\em A}nd {\em T}erminology {\em E}xt{\em R}action, MLAP-93/20) project, funded by the Commission of the European Communities. In particular, it addresses the issue of adapting the Xerox Tagger to Spanish in order to tag the Spanish version of the ITU (International Telecommunications Union) corpus. The model implemented by this tagger is briefly presented along with some modifications performed on it in order to use some parameters not probabilistically estimated. Initial decisions, like the tagset, the lexicon and the training corpus are also discussed. Finally, results are presented and the benefits of the {\em mixed model} justified.
Abstract:This working paper describes the Spanish tagset to be used in the context of CRATER, a CEC funded project aiming at the creation of a multilingual (English, French, Spanish) aligned corpus using the International Telecommunications Union corpus. In this respect, each version of the corpus will be (or is currently) tagged. Xerox PARC tagger will be adapted to Spanish in order to perform the tagging of the Spanish version. This tagset has been devised as the ideal one for Spanish, and has been posted to several lists in order to get feedback to it.