Alert button
Picture for Jan Hajič

Jan Hajič

Alert button

Quality and Efficiency of Manual Annotation: Pre-annotation Bias

Jun 15, 2023
Marie Mikulová, Milan Straka, Jan Štěpánek, Barbora Štěpánková, Jan Hajič

Figure 1 for Quality and Efficiency of Manual Annotation: Pre-annotation Bias
Figure 2 for Quality and Efficiency of Manual Annotation: Pre-annotation Bias
Figure 3 for Quality and Efficiency of Manual Annotation: Pre-annotation Bias
Figure 4 for Quality and Efficiency of Manual Annotation: Pre-annotation Bias

This paper presents an analysis of annotation using an automatic pre-annotation for a mid-level annotation complexity task -- dependency syntax annotation. It compares the annotation efforts made by annotators using a pre-annotated version (with a high-accuracy parser) and those made by fully manual annotation. The aim of the experiment is to judge the final annotation quality when pre-annotation is used. In addition, it evaluates the effect of automatic linguistically-based (rule-formulated) checks and another annotation on the same data available to the annotators, and their influence on annotation quality and efficiency. The experiment confirmed that the pre-annotation is an efficient tool for faster manual syntactic annotation which increases the consistency of the resulting annotation without reducing its quality.

* Published in LREC 2022 
Viaarxiv icon

Extending an Event-type Ontology: Adding Verbs and Classes Using Fine-tuned LLMs Suggestions

Jun 03, 2023
Jana Straková, Eva Fučíková, Jan Hajič, Zdeňka Urešová

Figure 1 for Extending an Event-type Ontology: Adding Verbs and Classes Using Fine-tuned LLMs Suggestions
Figure 2 for Extending an Event-type Ontology: Adding Verbs and Classes Using Fine-tuned LLMs Suggestions
Figure 3 for Extending an Event-type Ontology: Adding Verbs and Classes Using Fine-tuned LLMs Suggestions
Figure 4 for Extending an Event-type Ontology: Adding Verbs and Classes Using Fine-tuned LLMs Suggestions

In this project, we have investigated the use of advanced machine learning methods, specifically fine-tuned large language models, for pre-annotating data for a lexical extension task, namely adding descriptive words (verbs) to an existing (but incomplete, as of yet) ontology of event types. Several research questions have been focused on, from the investigation of a possible heuristics to provide at least hints to annotators which verbs to include and which are outside the current version of the ontology, to the possible use of the automatic scores to help the annotators to be more efficient in finding a threshold for identifying verbs that cannot be assigned to any existing class and therefore they are to be used as seeds for a new class. We have also carefully examined the correlation of the automatic scores with the human annotation. While the correlation turned out to be strong, its influence on the annotation proper is modest due to its near linearity, even though the mere fact of such pre-annotation leads to relatively short annotation times.

* Accepted to LAW-XVII @ ACL 2023 
Viaarxiv icon

Prague Dependency Treebank -- Consolidated 1.0

Jun 05, 2020
Jan Hajič, Eduard Bejček, Jaroslava Hlaváčová, Marie Mikulová, Milan Straka, Jan Štěpánek, Barbora Štěpánková

Figure 1 for Prague Dependency Treebank -- Consolidated 1.0
Figure 2 for Prague Dependency Treebank -- Consolidated 1.0
Figure 3 for Prague Dependency Treebank -- Consolidated 1.0
Figure 4 for Prague Dependency Treebank -- Consolidated 1.0

We present a richly annotated and genre-diversified language resource, the Prague Dependency Treebank-Consolidated 1.0 (PDT-C 1.0), the purpose of which is - as it always been the case for the family of the Prague Dependency Treebanks - to serve both as a training data for various types of NLP tasks as well as for linguistically-oriented research. PDT-C 1.0 contains four different datasets of Czech, uniformly annotated using the standard PDT scheme (albeit not everything is annotated manually, as we describe in detail here). The texts come from different sources: daily newspaper articles, Czech translation of the Wall Street Journal, transcribed dialogs and a small amount of user-generated, short, often non-standard language segments typed into a web translator. Altogether, the treebank contains around 180,000 sentences with their morphological, surface and deep syntactic annotation. The diversity of the texts and annotations should serve well the NLP applications as well as it is an invaluable resource for linguistic research, including comparative studies regarding texts of different genres. The corpus is publicly and freely available.

* Accepted at LREC 2020 (Proceedings of Language Resources and Evaluation, Marseille, France) 
Viaarxiv icon

Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Apr 22, 2020
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, Daniel Zeman

Figure 1 for Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection
Figure 2 for Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection
Figure 3 for Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection
Figure 4 for Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.

* LREC 2020 
Viaarxiv icon

The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe

Mar 30, 2020
Georg Rehm, Katrin Marheinecke, Stefanie Hegele, Stelios Piperidis, Kalina Bontcheva, Jan Hajič, Khalid Choukri, Andrejs Vasiļjevs, Gerhard Backfried, Christoph Prinz, José Manuel Gómez Pérez, Luc Meertens, Paul Lukowicz, Josef van Genabith, Andrea Lösch, Philipp Slusallek, Morten Irgens, Patrick Gatellier, Joachim Köhler, Laure Le Bars, Dimitra Anastasiou, Albina Auksoriūtė, Núria Bel, António Branco, Gerhard Budin, Walter Daelemans, Koenraad De Smedt, Radovan Garabík, Maria Gavriilidou, Dagmar Gromann, Svetla Koeva, Simon Krek, Cvetana Krstev, Krister Lindén, Bernardo Magnini, Jan Odijk, Maciej Ogrodniczuk, Eiríkur Rögnvaldsson, Mike Rosner, Bolette Sandford Pedersen, Inguna Skadiņa, Marko Tadić, Dan Tufiş, Tamás Váradi, Kadri Vider, Andy Way, François Yvon

Figure 1 for The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe
Figure 2 for The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe

Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe's specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI, including many opportunities, synergies but also misconceptions, has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.

* Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020). To appear 
Viaarxiv icon

Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER

Sep 08, 2019
Milan Straka, Jana Straková, Jan Hajič

Figure 1 for Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER
Figure 2 for Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER
Figure 3 for Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER
Figure 4 for Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER

Contextualized embeddings, which capture appropriate word meaning depending on context, have recently been proposed. We evaluate two meth ods for precomputing such embeddings, BERT and Flair, on four Czech text processing tasks: part-of-speech (POS) tagging, lemmatization, dependency pars ing and named entity recognition (NER). The first three tasks, POS tagging, lemmatization and dependency parsing, are evaluated on two corpora: the Prague Dependency Treebank 3.5 and the Universal Dependencies 2.3. The named entity recognition (NER) is evaluated on the Czech Named Entity Corpus 1.1 and 2.0. We report state-of-the-art results for the above mentioned tasks and corpora.

Viaarxiv icon

Evaluating Contextualized Embeddings on 54 Languages in POS Tagging, Lemmatization and Dependency Parsing

Aug 20, 2019
Milan Straka, Jana Straková, Jan Hajič

Figure 1 for Evaluating Contextualized Embeddings on 54 Languages in POS Tagging, Lemmatization and Dependency Parsing
Figure 2 for Evaluating Contextualized Embeddings on 54 Languages in POS Tagging, Lemmatization and Dependency Parsing
Figure 3 for Evaluating Contextualized Embeddings on 54 Languages in POS Tagging, Lemmatization and Dependency Parsing
Figure 4 for Evaluating Contextualized Embeddings on 54 Languages in POS Tagging, Lemmatization and Dependency Parsing

We present an extensive evaluation of three recently proposed methods for contextualized embeddings on 89 corpora in 54 languages of the Universal Dependencies 2.3 in three tasks: POS tagging, lemmatization, and dependency parsing. Employing the BERT, Flair and ELMo as pretrained embedding inputs in a strong baseline of UDPipe 2.0, one of the best-performing systems of the CoNLL 2018 Shared Task and an overall winner of the EPE 2018, we present a one-to-one comparison of the three contextualized word embedding methods, as well as a comparison with word2vec-like pretrained embeddings and with end-to-end character-level word embeddings. We report state-of-the-art results in all three tasks as compared to results on UD 2.2 in the CoNLL 2018 Shared Task.

Viaarxiv icon

UDPipe at SIGMORPHON 2019: Contextualized Embeddings, Regularization with Morphological Categories, Corpora Merging

Aug 19, 2019
Milan Straka, Jana Straková, Jan Hajič

Figure 1 for UDPipe at SIGMORPHON 2019: Contextualized Embeddings, Regularization with Morphological Categories, Corpora Merging
Figure 2 for UDPipe at SIGMORPHON 2019: Contextualized Embeddings, Regularization with Morphological Categories, Corpora Merging
Figure 3 for UDPipe at SIGMORPHON 2019: Contextualized Embeddings, Regularization with Morphological Categories, Corpora Merging
Figure 4 for UDPipe at SIGMORPHON 2019: Contextualized Embeddings, Regularization with Morphological Categories, Corpora Merging

We present our contribution to the SIGMORPHON 2019 Shared Task: Crosslinguality and Context in Morphology, Task 2: contextual morphological analysis and lemmatization. We submitted a modification of the UDPipe 2.0, one of best-performing systems of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies and an overall winner of the The 2018 Shared Task on Extrinsic Parser Evaluation. As our first improvement, we use the pretrained contextualized embeddings (BERT) as additional inputs to the network; secondly, we use individual morphological features as regularization; and finally, we merge the selected corpora of the same language. In the lemmatization task, our system exceeds all the submitted systems by a wide margin with lemmatization accuracy 95.78 (second best was 95.00, third 94.46). In the morphological analysis, our system placed tightly second: our morphological analysis accuracy was 93.19, the winning system's 93.23.

* Accepted by SIGMORPHON 2019: 16th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology 
Viaarxiv icon

Neural Architectures for Nested NER through Linearization

Aug 19, 2019
Jana Straková, Milan Straka, Jan Hajič

Figure 1 for Neural Architectures for Nested NER through Linearization
Figure 2 for Neural Architectures for Nested NER through Linearization

We propose two neural network architectures for nested named entity recognition (NER), a setting in which named entities may overlap and also be labeled with more than one label. We encode the nested labels using a linearized scheme. In our first proposed approach, the nested labels are modeled as multilabels corresponding to the Cartesian product of the nested labels in a standard LSTM-CRF architecture. In the second one, the nested NER is viewed as a sequence-to-sequence problem, in which the input sequence consists of the tokens and output sequence of the labels, using hard attention on the word whose label is being predicted. The proposed methods outperform the nested NER state of the art on four corpora: ACE-2004, ACE-2005, GENIA and Czech CNEC. We also enrich our architectures with the recently published contextual embeddings: ELMo, BERT and Flair, reaching further improvements for the four nested entity corpora. In addition, we report flat NER state-of-the-art results for CoNLL-2002 Dutch and Spanish and for CoNLL-2003 English.

* Accepted by ACL 2019 
Viaarxiv icon

LemmaTag: Jointly Tagging and Lemmatizing for Morphologically-Rich Languages with BRNNs

Aug 27, 2018
Daniel Kondratyuk, Tomáš Gavenčiak, Milan Straka, Jan Hajič

Figure 1 for LemmaTag: Jointly Tagging and Lemmatizing for Morphologically-Rich Languages with BRNNs
Figure 2 for LemmaTag: Jointly Tagging and Lemmatizing for Morphologically-Rich Languages with BRNNs
Figure 3 for LemmaTag: Jointly Tagging and Lemmatizing for Morphologically-Rich Languages with BRNNs
Figure 4 for LemmaTag: Jointly Tagging and Lemmatizing for Morphologically-Rich Languages with BRNNs

We present LemmaTag, a featureless neural network architecture that jointly generates part-of-speech tags and lemmas for sentences by using bidirectional RNNs with character-level and word-level embeddings. We demonstrate that both tasks benefit from sharing the encoding part of the network, predicting tag subcategories, and using the tagger output as an input to the lemmatizer. We evaluate our model across several languages with complex morphology, which surpasses state-of-the-art accuracy in both part-of-speech tagging and lemmatization in Czech, German, and Arabic.

* 8 pages, 3 figures. Submitted to EMNLP 2018 
Viaarxiv icon