Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniel Zeman

Towards Generating Automatic Anaphora Annotations

Mar 12, 2025

Dima Taji, Daniel Zeman

Abstract:Training models that can perform well on various NLP tasks require large amounts of data, and this becomes more apparent with nuanced tasks such as anaphora and conference resolution. To combat the prohibitive costs of creating manual gold annotated data, this paper explores two methods to automatically create datasets with coreferential annotations; direct conversion from existing datasets, and parsing using multilingual models capable of handling new and unseen languages. The paper details the current progress on those two fronts, as well as the challenges the efforts currently face, and our approach to overcoming these challenges.

* 6 pages, 0 figures, 2 tables

Via

Access Paper or Ask Questions

Findings of the Third Shared Task on Multilingual Coreference Resolution

Oct 21, 2024

Michal Novák, Barbora Dohnalová, Miloslav Konopík, Anna Nedoluzhko, Martin Popel, Ondřej Pražák, Jakub Sido, Milan Straka, Zdeněk Žabokrtský, Daniel Zeman

Abstract:The paper presents an overview of the third edition of the shared task on multilingual coreference resolution, held as part of the CRAC 2024 workshop. Similarly to the previous two editions, the participants were challenged to develop systems capable of identifying mentions and clustering them based on identity coreference. This year's edition took another step towards real-world application by not providing participants with gold slots for zero anaphora, increasing the task's complexity and realism. In addition, the shared task was expanded to include a more diverse set of languages, with a particular focus on historical languages. The training and evaluation data were drawn from version 1.2 of the multilingual collection of harmonized coreference resources CorefUD, encompassing 21 datasets across 15 languages. 6 systems competed in this shared task.

* Accepted to CRAC 2024

Via

Access Paper or Ask Questions

A Unified Taxonomy of Deep Syntactic Relations

Mar 21, 2023

Kira Droganova, Daniel Zeman

Abstract:This paper analyzes multiple deep-syntactic frameworks with the goal of creating a proposal for a set of universal semantic role labels. The proposal examines various theoretic linguistic perspectives and focuses on Meaning-Text Theory and Functional Generative Description frameworks. For the purpose of this research, data from four languages is used -- Spanish and Catalan (Taule et al., 2011), Czech (Hajic et al., 2017), and English (Hajic et al., 2012). This proposal is oriented towards Universal Dependencies (de Marneffe et al., 2021) with a further intention of applying the universal semantic role labels to the UD data.

Via

Access Paper or Ask Questions

Findings of the Shared Task on Multilingual Coreference Resolution

Sep 16, 2022

Zdeněk Žabokrtský, Miloslav Konopík, Anna Nedoluzhko, Michal Novák, Maciej Ogrodniczuk, Martin Popel, Ondřej Pražák, Jakub Sido, Daniel Zeman, Yilun Zhu

Figure 1 for Findings of the Shared Task on Multilingual Coreference Resolution

Figure 2 for Findings of the Shared Task on Multilingual Coreference Resolution

Figure 3 for Findings of the Shared Task on Multilingual Coreference Resolution

Figure 4 for Findings of the Shared Task on Multilingual Coreference Resolution

Abstract:This paper presents an overview of the shared task on multilingual coreference resolution associated with the CRAC 2022 workshop. Shared task participants were supposed to develop trainable systems capable of identifying mentions and clustering them according to identity coreference. The public edition of CorefUD 1.0, which contains 13 datasets for 10 languages, was used as the source of training and evaluation data. The CoNLL score used in previous coreference-oriented shared tasks was used as the main evaluation metric. There were 8 coreference prediction systems submitted by 5 participating teams; in addition, there was a competitive Transformer-based baseline system provided by the organizers at the beginning of the shared task. The winner system outperformed the baseline by 12 percentage points (in terms of the CoNLL scores averaged across all datasets for individual languages).

Via

Access Paper or Ask Questions

Predicting Typological Features in WALS using Language Embeddings and Conditional Probabilities: ÚFAL Submission to the SIGTYP 2020 Shared Task

Oct 08, 2020

Martin Vastl, Daniel Zeman, Rudolf Rosa

Figure 1 for Predicting Typological Features in WALS using Language Embeddings and Conditional Probabilities: ÚFAL Submission to the SIGTYP 2020 Shared Task

Figure 2 for Predicting Typological Features in WALS using Language Embeddings and Conditional Probabilities: ÚFAL Submission to the SIGTYP 2020 Shared Task

Figure 3 for Predicting Typological Features in WALS using Language Embeddings and Conditional Probabilities: ÚFAL Submission to the SIGTYP 2020 Shared Task

Abstract:We present our submission to the SIGTYP 2020 Shared Task on the prediction of typological features. We submit a constrained system, predicting typological features only based on the WALS database. We investigate two approaches. The simpler of the two is a system based on estimating correlation of feature values within languages by computing conditional probabilities and mutual information. The second approach is to train a neural predictor operating on precomputed language embeddings based on WALS features. Our submitted system combines the two approaches based on their self-estimated confidence scores. We reach the accuracy of 70.7% on the test data and rank first in the shared task.

Via

Access Paper or Ask Questions

Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Apr 22, 2020

Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, Daniel Zeman

Figure 1 for Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Figure 2 for Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Figure 3 for Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Figure 4 for Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Abstract:Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.

* LREC 2020

Via

Access Paper or Ask Questions

Automatic Extraction of Subcategorization Frames for Czech

Sep 08, 2000

Anoop Sarkar, Daniel Zeman

Figure 1 for Automatic Extraction of Subcategorization Frames for Czech

Figure 2 for Automatic Extraction of Subcategorization Frames for Czech

Figure 3 for Automatic Extraction of Subcategorization Frames for Czech

Figure 4 for Automatic Extraction of Subcategorization Frames for Czech

Abstract:We present some novel machine learning techniques for the identification of subcategorization information for verbs in Czech. We compare three different statistical techniques applied to this problem. We show how the learning algorithm can be used to discover previously unknown subcategorization frames from the Czech Prague Dependency Treebank. The algorithm can then be used to label dependents of a verb in the Czech treebank as either arguments or adjuncts. Using our techniques, we ar able to achieve 88% precision on unseen parsed text.

* Proceedings of the 18th International Conference on Computational Linguistics (Coling 2000), Universit
* 7 pages. Another version under the name "Learning Verb Subcategorization from Corpora: Counting Frame Subsets", authors: Zeman, Sarkar, in proceedings of LREC 2000, Athens, Greece

Via

Access Paper or Ask Questions