Alert button
Picture for David Yarowsky

David Yarowsky

Alert button

UniMorph 4.0: Universal Morphology

May 10, 2022
Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay, Juan López Bautista, Gema Celeste Silva Villegas, Lucas Torroba Hennigen, Adam Ek, David Guriel, Peter Dirix, Jean-Philippe Bernardy, Andrey Scherbakov, Aziyana Bayyr-ool, Antonios Anastasopoulos, Roberto Zariquiey, Karina Sheifer, Sofya Ganieva, Hilaria Cruz, Ritván Karahóǧa, Stella Markantonatou, George Pavlidis, Matvey Plugaryov, Elena Klyachko, Ali Salehi, Candy Angulo, Jatayu Baxi, Andrew Krizhanovsky, Natalia Krizhanovskaya, Elizabeth Salesky, Clara Vania, Sardana Ivanova, Jennifer White, Rowan Hall Maudslay, Josef Valvoda, Ran Zmigrod, Paula Czarnowska, Irene Nikkarinen, Aelita Salchak, Brijesh Bhatt, Christopher Straughn, Zoey Liu, Jonathan North Washington, Yuval Pinter, Duygu Ataman, Marcin Wolinski, Totok Suhardijanto, Anna Yablonskaya, Niklas Stoehr, Hossep Dolatian, Zahroh Nuriah, Shyam Ratan, Francis M. Tyers, Edoardo M. Ponti, Grant Aiton, Aryaman Arora, Richard J. Hatcher, Ritesh Kumar, Jeremiah Young, Daria Rodionova, Anastasia Yemelina, Taras Andrushko, Igor Marchenko, Polina Mashkovtseva, Alexandra Serova, Emily Prud'hommeaux, Maria Nepomniashchaya, Fausto Giunchiglia, Eleanor Chodroff, Mans Hulden, Miikka Silfverberg, Arya D. McCarthy, David Yarowsky, Ryan Cotterell, Reut Tsarfaty, Ekaterina Vylomova

Figure 1 for UniMorph 4.0: Universal Morphology
Figure 2 for UniMorph 4.0: Universal Morphology
Figure 3 for UniMorph 4.0: Universal Morphology
Figure 4 for UniMorph 4.0: Universal Morphology

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.

* LREC 2022; The first two authors made equal contributions 
Viaarxiv icon

Induced Inflection-Set Keyword Search in Speech

Oct 27, 2019
Oliver Adams, Matthew Wiesner, Jan Trmal, Garrett Nicolai, David Yarowsky

Figure 1 for Induced Inflection-Set Keyword Search in Speech
Figure 2 for Induced Inflection-Set Keyword Search in Speech
Figure 3 for Induced Inflection-Set Keyword Search in Speech

We investigate the problem of searching for a lexeme-set in speech by searching for its inflectional variants. Experimental results indicate how lexeme-set search performance changes with the number of hypothesized inflections, while ablation experiments highlight the relative importance of different components in the lexeme-set search pipeline. We provide a recipe and evaluation set for the community to use as an extrinsic measure of the performance of inflection generation approaches.

Viaarxiv icon

Modeling Color Terminology Across Thousands of Languages

Oct 03, 2019
Arya D. McCarthy, Winston Wu, Aaron Mueller, Bill Watson, David Yarowsky

Figure 1 for Modeling Color Terminology Across Thousands of Languages
Figure 2 for Modeling Color Terminology Across Thousands of Languages
Figure 3 for Modeling Color Terminology Across Thousands of Languages
Figure 4 for Modeling Color Terminology Across Thousands of Languages

There is an extensive history of scholarship into what constitutes a "basic" color term, as well as a broadly attested acquisition sequence of basic color terms across many languages, as articulated in the seminal work of Berlin and Kay (1969). This paper employs a set of diverse measures on massively cross-linguistic data to operationalize and critique the Berlin and Kay color term hypotheses. Collectively, the 14 empirically-grounded computational linguistic metrics we design---as well as their aggregation---correlate strongly with both the Berlin and Kay basic/secondary color term partition (gamma=0.96) and their hypothesized universal acquisition sequence. The measures and result provide further empirical evidence from computational linguistics in support of their claims, as well as additional nuance: they suggest treating the partition as a spectrum instead of a dichotomy.

* Accepted for presentation at EMNLP-IJCNLP 2019 
Viaarxiv icon

Massively Multilingual Adversarial Speech Recognition

Apr 03, 2019
Oliver Adams, Matthew Wiesner, Shinji Watanabe, David Yarowsky

Figure 1 for Massively Multilingual Adversarial Speech Recognition
Figure 2 for Massively Multilingual Adversarial Speech Recognition
Figure 3 for Massively Multilingual Adversarial Speech Recognition
Figure 4 for Massively Multilingual Adversarial Speech Recognition

We report on adaptation of multilingual end-to-end speech recognition models trained on as many as 100 languages. Our findings shed light on the relative importance of similarity between the target and pretraining languages along the dimensions of phonetics, phonology, language family, geographical location, and orthography. In this context, experiments demonstrate the effectiveness of two additional pretraining objectives in encouraging language-independent encoder representations: a context-independent phoneme objective paired with a language-adversarial classification objective.

* Accepted at NAACL-HLT 2019 
Viaarxiv icon

UniMorph 2.0: Universal Morphology

Oct 25, 2018
Christo Kirov, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sebastian Mielke, Arya D. McCarthy, Sandra Kübler, David Yarowsky, Jason Eisner, Mans Hulden

Figure 1 for UniMorph 2.0: Universal Morphology
Figure 2 for UniMorph 2.0: Universal Morphology
Figure 3 for UniMorph 2.0: Universal Morphology
Figure 4 for UniMorph 2.0: Universal Morphology

The Universal Morphology UniMorph project is a collaborative effort to improve how NLP handles complex morphology across the world's languages. The project releases annotated morphological data using a universal tagset, the UniMorph schema. Each inflected form is associated with a lemma, which typically carries its underlying lexical meaning, and a bundle of morphological features from our schema. Additional supporting data and tools are also released on a per-language basis when available. UniMorph is based at the Center for Language and Speech Processing (CLSP) at Johns Hopkins University in Baltimore, Maryland and is sponsored by the DARPA LORELEI program. This paper details advances made to the collection, annotation, and dissemination of project resources since the initial UniMorph release described at LREC 2016. lexical resources} }

* LREC 2018 
Viaarxiv icon

The CoNLL--SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

Oct 18, 2018
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Arya D. McCarthy, Katharina Kann, Sebastian Mielke, Garrett Nicolai, Miikka Silfverberg, David Yarowsky, Jason Eisner, Mans Hulden

Figure 1 for The CoNLL--SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection
Figure 2 for The CoNLL--SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection
Figure 3 for The CoNLL--SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection
Figure 4 for The CoNLL--SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

The CoNLL--SIGMORPHON 2018 shared task on supervised learning of morphological generation featured data sets from 103 typologically diverse languages. Apart from extending the number of languages involved in earlier supervised tasks of generating inflected forms, this year the shared task also featured a new second task which asked participants to inflect words in sentential context, similar to a cloze task. This second task featured seven languages. Task 1 received 27 submissions and task 2 received 6 submissions. Both tasks featured a low, medium, and high data condition. Nearly all submissions featured a neural component and built on highly-ranked systems from the earlier 2017 shared task. In the inflection task (task 1), 41 of the 52 languages present in last year's inflection task showed improvement by the best systems in the low-resource setting. The cloze task (task 2) proved to be difficult, and few submissions managed to consistently improve upon both a simple neural baseline system and a lemma-repeating baseline.

* CoNLL 2018. arXiv admin note: text overlap with arXiv:1706.09031 
Viaarxiv icon

Marrying Universal Dependencies and Universal Morphology

Oct 15, 2018
Arya D. McCarthy, Miikka Silfverberg, Ryan Cotterell, Mans Hulden, David Yarowsky

Figure 1 for Marrying Universal Dependencies and Universal Morphology
Figure 2 for Marrying Universal Dependencies and Universal Morphology
Figure 3 for Marrying Universal Dependencies and Universal Morphology
Figure 4 for Marrying Universal Dependencies and Universal Morphology

The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of language. Each project also provides corpora of annotated text in many languages - UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. With compatibility of tags, each project's annotations could be used to validate the other's. Additionally, the availability of both type- and token-level resources would be a boon to tasks such as parsing and homograph disambiguation. To ease this interoperability, we present a deterministic mapping from Universal Dependencies v2 features into the UniMorph schema. We validate our approach by lookup in the UniMorph corpora and find a macro-average of 64.13% recall. We also note incompatibilities due to paucity of data on either side. Finally, we present a critical evaluation of the foundations, strengths, and weaknesses of the two annotation projects.

* UDW18 
Viaarxiv icon

Paradigm Completion for Derivational Morphology

Aug 30, 2017
Ryan Cotterell, Ekaterina Vylomova, Huda Khayrallah, Christo Kirov, David Yarowsky

Figure 1 for Paradigm Completion for Derivational Morphology
Figure 2 for Paradigm Completion for Derivational Morphology
Figure 3 for Paradigm Completion for Derivational Morphology
Figure 4 for Paradigm Completion for Derivational Morphology

The generation of complex derived word forms has been an overlooked problem in NLP; we fill this gap by applying neural sequence-to-sequence models to the task. We overview the theoretical motivation for a paradigmatic treatment of derivational morphology, and introduce the task of derivational paradigm completion as a parallel to inflectional paradigm completion. State-of-the-art neural models, adapted from the inflection task, are able to learn a range of derivation patterns, and outperform a non-neural baseline by 16.4%. However, due to semantic, historical, and lexical considerations involved in derivational morphology, future work will be needed to achieve performance parity with inflection-generating systems.

* EMNLP 2017 
Viaarxiv icon

CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages

Jul 04, 2017
Ryan Cotterell, Christo Kirov, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sandra Kübler, David Yarowsky, Jason Eisner, Mans Hulden

Figure 1 for CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages
Figure 2 for CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages
Figure 3 for CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages
Figure 4 for CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages

The CoNLL-SIGMORPHON 2017 shared task on supervised morphological generation required systems to be trained and tested in each of 52 typologically diverse languages. In sub-task 1, submitted systems were asked to predict a specific inflected form of a given lemma. In sub-task 2, systems were given a lemma and some of its specific inflected forms, and asked to complete the inflectional paradigm by predicting all of the remaining inflected forms. Both sub-tasks included high, medium, and low-resource conditions. Sub-task 1 received 24 system submissions, while sub-task 2 received 3 system submissions. Following the success of neural sequence-to-sequence models in the SIGMORPHON 2016 shared task, all but one of the submissions included a neural component. The results show that high performance can be achieved with small training datasets, so long as models have appropriate inductive bias or make use of additional unlabeled data or synthetic data. However, different biasing and data augmentation resulted in disjoint sets of inflected forms being predicted correctly, suggesting that there is room for future improvement.

* CoNLL 2017 
Viaarxiv icon

Rule Writing or Annotation: Cost-efficient Resource Usage for Base Noun Phrase Chunking

May 02, 2001
Grace Ngai, David Yarowsky

Figure 1 for Rule Writing or Annotation: Cost-efficient Resource Usage for Base Noun Phrase Chunking
Figure 2 for Rule Writing or Annotation: Cost-efficient Resource Usage for Base Noun Phrase Chunking
Figure 3 for Rule Writing or Annotation: Cost-efficient Resource Usage for Base Noun Phrase Chunking
Figure 4 for Rule Writing or Annotation: Cost-efficient Resource Usage for Base Noun Phrase Chunking

This paper presents a comprehensive empirical comparison between two approaches for developing a base noun phrase chunker: human rule writing and active learning using interactive real-time human annotation. Several novel variations on active learning are investigated, and underlying cost models for cross-modal machine learning comparison are presented and explored. Results show that it is more efficient and more successful by several measures to train a system using active learning annotation rather than hand-crafted rule writing at a comparable level of human labor investment.

* Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 117-125, Hong Kong (2000)  
* 9 pages, 4 figures, appeared in ACL2000 
Viaarxiv icon