Alert button
Picture for Daniel Zeman

Daniel Zeman

Alert button

A Unified Taxonomy of Deep Syntactic Relations

Mar 21, 2023
Kira Droganova, Daniel Zeman

Figure 1 for A Unified Taxonomy of Deep Syntactic Relations
Figure 2 for A Unified Taxonomy of Deep Syntactic Relations
Figure 3 for A Unified Taxonomy of Deep Syntactic Relations
Figure 4 for A Unified Taxonomy of Deep Syntactic Relations

This paper analyzes multiple deep-syntactic frameworks with the goal of creating a proposal for a set of universal semantic role labels. The proposal examines various theoretic linguistic perspectives and focuses on Meaning-Text Theory and Functional Generative Description frameworks. For the purpose of this research, data from four languages is used -- Spanish and Catalan (Taule et al., 2011), Czech (Hajic et al., 2017), and English (Hajic et al., 2012). This proposal is oriented towards Universal Dependencies (de Marneffe et al., 2021) with a further intention of applying the universal semantic role labels to the UD data.

Viaarxiv icon

Findings of the Shared Task on Multilingual Coreference Resolution

Sep 16, 2022
Zdeněk Žabokrtský, Miloslav Konopík, Anna Nedoluzhko, Michal Novák, Maciej Ogrodniczuk, Martin Popel, Ondřej Pražák, Jakub Sido, Daniel Zeman, Yilun Zhu

Figure 1 for Findings of the Shared Task on Multilingual Coreference Resolution
Figure 2 for Findings of the Shared Task on Multilingual Coreference Resolution
Figure 3 for Findings of the Shared Task on Multilingual Coreference Resolution
Figure 4 for Findings of the Shared Task on Multilingual Coreference Resolution

This paper presents an overview of the shared task on multilingual coreference resolution associated with the CRAC 2022 workshop. Shared task participants were supposed to develop trainable systems capable of identifying mentions and clustering them according to identity coreference. The public edition of CorefUD 1.0, which contains 13 datasets for 10 languages, was used as the source of training and evaluation data. The CoNLL score used in previous coreference-oriented shared tasks was used as the main evaluation metric. There were 8 coreference prediction systems submitted by 5 participating teams; in addition, there was a competitive Transformer-based baseline system provided by the organizers at the beginning of the shared task. The winner system outperformed the baseline by 12 percentage points (in terms of the CoNLL scores averaged across all datasets for individual languages).

Viaarxiv icon

Predicting Typological Features in WALS using Language Embeddings and Conditional Probabilities: ÚFAL Submission to the SIGTYP 2020 Shared Task

Oct 08, 2020
Martin Vastl, Daniel Zeman, Rudolf Rosa

Figure 1 for Predicting Typological Features in WALS using Language Embeddings and Conditional Probabilities: ÚFAL Submission to the SIGTYP 2020 Shared Task
Figure 2 for Predicting Typological Features in WALS using Language Embeddings and Conditional Probabilities: ÚFAL Submission to the SIGTYP 2020 Shared Task
Figure 3 for Predicting Typological Features in WALS using Language Embeddings and Conditional Probabilities: ÚFAL Submission to the SIGTYP 2020 Shared Task

We present our submission to the SIGTYP 2020 Shared Task on the prediction of typological features. We submit a constrained system, predicting typological features only based on the WALS database. We investigate two approaches. The simpler of the two is a system based on estimating correlation of feature values within languages by computing conditional probabilities and mutual information. The second approach is to train a neural predictor operating on precomputed language embeddings based on WALS features. Our submitted system combines the two approaches based on their self-estimated confidence scores. We reach the accuracy of 70.7% on the test data and rank first in the shared task.

Viaarxiv icon

Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Apr 22, 2020
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, Daniel Zeman

Figure 1 for Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection
Figure 2 for Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection
Figure 3 for Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection
Figure 4 for Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection

Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.

* LREC 2020 
Viaarxiv icon

Automatic Extraction of Subcategorization Frames for Czech

Sep 08, 2000
Anoop Sarkar, Daniel Zeman

Figure 1 for Automatic Extraction of Subcategorization Frames for Czech
Figure 2 for Automatic Extraction of Subcategorization Frames for Czech
Figure 3 for Automatic Extraction of Subcategorization Frames for Czech
Figure 4 for Automatic Extraction of Subcategorization Frames for Czech

We present some novel machine learning techniques for the identification of subcategorization information for verbs in Czech. We compare three different statistical techniques applied to this problem. We show how the learning algorithm can be used to discover previously unknown subcategorization frames from the Czech Prague Dependency Treebank. The algorithm can then be used to label dependents of a verb in the Czech treebank as either arguments or adjuncts. Using our techniques, we ar able to achieve 88% precision on unseen parsed text.

* Proceedings of the 18th International Conference on Computational Linguistics (Coling 2000), Universit  
* 7 pages. Another version under the name "Learning Verb Subcategorization from Corpora: Counting Frame Subsets", authors: Zeman, Sarkar, in proceedings of LREC 2000, Athens, Greece 
Viaarxiv icon