Alert button
Picture for Aryaman Arora

Aryaman Arora

Alert button

IruMozhi: Automatically classifying diglossia in Tamil

Nov 13, 2023
Kabilan Prasanna, Aryaman Arora

Tamil, a Dravidian language of South Asia, is a highly diglossic language with two very different registers in everyday use: Literary Tamil (preferred in writing and formal communication) and Spoken Tamil (confined to speech and informal media). Spoken Tamil is under-supported in modern NLP systems. In this paper, we release IruMozhi, a human-annotated dataset of parallel text in Literary and Spoken Tamil. We train classifiers on the task of identifying which variety a text belongs to. We use these models to gauge the availability of pretraining data in Spoken Tamil, to audit the composition of existing labelled datasets for Tamil, and to encourage future work on the variety.

* 4 pages main text, 7 total 
Viaarxiv icon

Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP

Aug 27, 2023
Vedant Palit, Rohan Pandey, Aryaman Arora, Paul Pu Liang

Figure 1 for Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP
Figure 2 for Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP
Figure 3 for Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP
Figure 4 for Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP

Mechanistic interpretability seeks to understand the neural mechanisms that enable specific behaviors in Large Language Models (LLMs) by leveraging causality-based methods. While these approaches have identified neural circuits that copy spans of text, capture factual knowledge, and more, they remain unusable for multimodal models since adapting these tools to the vision-language domain requires considerable architectural changes. In this work, we adapt a unimodal causal tracing tool to BLIP to enable the study of the neural mechanisms underlying image-conditioned text generation. We demonstrate our approach on a visual question answering dataset, highlighting the causal relevance of later layer representations for all tokens. Furthermore, we release our BLIP causal tracing tool as open source to enable further experimentation in vision-language mechanistic interpretability by the community. Our code is available at https://github.com/vedantpalit/Towards-Vision-Language-Mechanistic-Interpretability.

* Final version for 5th Workshop on Closing the Loop Between Vision and Language (CLVL) @ ICCV 2023. 4 pages, 5 figures 
Viaarxiv icon

Jambu: A historical linguistic database for South Asian languages

Jun 05, 2023
Aryaman Arora, Adam Farris, Samopriya Basu, Suresh Kolichala

Figure 1 for Jambu: A historical linguistic database for South Asian languages
Figure 2 for Jambu: A historical linguistic database for South Asian languages
Figure 3 for Jambu: A historical linguistic database for South Asian languages
Figure 4 for Jambu: A historical linguistic database for South Asian languages

We introduce Jambu, a cognate database of South Asian languages which unifies dozens of previous sources in a structured and accessible format. The database includes 287k lemmata from 602 lects, grouped together in 23k sets of cognates. We outline the data wrangling necessary to compile the dataset and train neural models for reflex prediction on the Indo-Aryan subset of the data. We hope that Jambu is an invaluable resource for all historical linguists and Indologists, and look towards further improvement and expansion of the database.

* 5 pages main text, 10 pages total. To appear at SIGMORPHON 
Viaarxiv icon

CGELBank Annotation Manual v1.0

May 27, 2023
Brett Reynolds, Nathan Schneider, Aryaman Arora

Figure 1 for CGELBank Annotation Manual v1.0
Figure 2 for CGELBank Annotation Manual v1.0

CGELBank is a treebank and associated tools based on a syntactic formalism for English derived from the Cambridge Grammar of the English Language. This document lays out the particularities of the CGELBank annotation scheme.

Viaarxiv icon

Localizing Model Behavior with Path Patching

Apr 12, 2023
Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, Aryaman Arora

Figure 1 for Localizing Model Behavior with Path Patching
Figure 2 for Localizing Model Behavior with Path Patching
Figure 3 for Localizing Model Behavior with Path Patching
Figure 4 for Localizing Model Behavior with Path Patching

Localizing behaviors of neural networks to a subset of the network's components or a subset of interactions between components is a natural first step towards analyzing network mechanisms and possible failure modes. Existing work is often qualitative and ad-hoc, and there is no consensus on the appropriate way to evaluate localization claims. We introduce path patching, a technique for expressing and quantitatively testing a natural class of hypotheses expressing that behaviors are localized to a set of paths. We refine an explanation of induction heads, characterize a behavior of GPT-2, and open source a framework for efficiently running similar experiments.

* 20 pages, 16 figures 
Viaarxiv icon

CGELBank: CGEL as a Framework for English Syntax Annotation

Oct 01, 2022
Brett Reynolds, Aryaman Arora, Nathan Schneider

Figure 1 for CGELBank: CGEL as a Framework for English Syntax Annotation
Figure 2 for CGELBank: CGEL as a Framework for English Syntax Annotation
Figure 3 for CGELBank: CGEL as a Framework for English Syntax Annotation
Figure 4 for CGELBank: CGEL as a Framework for English Syntax Annotation

We introduce the syntactic formalism of the \textit{Cambridge Grammar of the English Language} (CGEL) to the world of treebanking through the CGELBank project. We discuss some issues in linguistic analysis that arose in adapting the formalism to corpus annotation, followed by quantitative and qualitative comparisons with parallel UD and PTB treebanks. We argue that CGEL provides a good tradeoff between comprehensiveness of analysis and usability for annotation, which motivates expanding the treebank with automatic conversion in the future.

* 11 pages (8 main text) 
Viaarxiv icon

The SIGMORPHON 2022 Shared Task on Morpheme Segmentation

Jun 15, 2022
Khuyagbaatar Batsuren, Gábor Bella, Aryaman Arora, Viktor Martinović, Kyle Gorman, Zdeněk Žabokrtský, Amarsanaa Ganbold, Šárka Dohnalová, Magda Ševčíková, Kateřina Pelegrinová, Fausto Giunchiglia, Ryan Cotterell, Ekaterina Vylomova

Figure 1 for The SIGMORPHON 2022 Shared Task on Morpheme Segmentation
Figure 2 for The SIGMORPHON 2022 Shared Task on Morpheme Segmentation
Figure 3 for The SIGMORPHON 2022 Shared Task on Morpheme Segmentation
Figure 4 for The SIGMORPHON 2022 Shared Task on Morpheme Segmentation

The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes and covered most types of morphology: compounds, derivations, and inflections. Subtask 1, word-level morpheme segmentation, covered 5 million words in 9 languages (Czech, English, Spanish, Hungarian, French, Italian, Russian, Latin, Mongolian) and received 13 system submissions from 7 teams and the best system averaged 97.29% F1 score across all languages, ranging English (93.84%) to Latin (99.38%). Subtask 2, sentence-level morpheme segmentation, covered 18,735 sentences in 3 languages (Czech, English, Mongolian), received 10 system submissions from 3 teams, and the best systems outperformed all three state-of-the-art subword tokenization methods (BPE, ULM, Morfessor2) by 30.71% absolute. To facilitate error analysis and support any type of future studies, we released all system predictions, the evaluation script, and all gold standard datasets.

* The 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology 
Viaarxiv icon

UniMorph 4.0: Universal Morphology

May 10, 2022
Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina J. Mielke, Elena Budianskaya, Charbel El-Khaissi, Tiago Pimentel, Michael Gasser, William Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay, Juan López Bautista, Gema Celeste Silva Villegas, Lucas Torroba Hennigen, Adam Ek, David Guriel, Peter Dirix, Jean-Philippe Bernardy, Andrey Scherbakov, Aziyana Bayyr-ool, Antonios Anastasopoulos, Roberto Zariquiey, Karina Sheifer, Sofya Ganieva, Hilaria Cruz, Ritván Karahóǧa, Stella Markantonatou, George Pavlidis, Matvey Plugaryov, Elena Klyachko, Ali Salehi, Candy Angulo, Jatayu Baxi, Andrew Krizhanovsky, Natalia Krizhanovskaya, Elizabeth Salesky, Clara Vania, Sardana Ivanova, Jennifer White, Rowan Hall Maudslay, Josef Valvoda, Ran Zmigrod, Paula Czarnowska, Irene Nikkarinen, Aelita Salchak, Brijesh Bhatt, Christopher Straughn, Zoey Liu, Jonathan North Washington, Yuval Pinter, Duygu Ataman, Marcin Wolinski, Totok Suhardijanto, Anna Yablonskaya, Niklas Stoehr, Hossep Dolatian, Zahroh Nuriah, Shyam Ratan, Francis M. Tyers, Edoardo M. Ponti, Grant Aiton, Aryaman Arora, Richard J. Hatcher, Ritesh Kumar, Jeremiah Young, Daria Rodionova, Anastasia Yemelina, Taras Andrushko, Igor Marchenko, Polina Mashkovtseva, Alexandra Serova, Emily Prud'hommeaux, Maria Nepomniashchaya, Fausto Giunchiglia, Eleanor Chodroff, Mans Hulden, Miikka Silfverberg, Arya D. McCarthy, David Yarowsky, Ryan Cotterell, Reut Tsarfaty, Ekaterina Vylomova

Figure 1 for UniMorph 4.0: Universal Morphology
Figure 2 for UniMorph 4.0: Universal Morphology
Figure 3 for UniMorph 4.0: Universal Morphology
Figure 4 for UniMorph 4.0: Universal Morphology

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.

* LREC 2022; The first two authors made equal contributions 
Viaarxiv icon

MASALA: Modelling and Analysing the Semantics of Adpositions in Linguistic Annotation of Hindi

May 08, 2022
Aryaman Arora, Nitin Venkateswaran, Nathan Schneider

Figure 1 for MASALA: Modelling and Analysing the Semantics of Adpositions in Linguistic Annotation of Hindi
Figure 2 for MASALA: Modelling and Analysing the Semantics of Adpositions in Linguistic Annotation of Hindi
Figure 3 for MASALA: Modelling and Analysing the Semantics of Adpositions in Linguistic Annotation of Hindi
Figure 4 for MASALA: Modelling and Analysing the Semantics of Adpositions in Linguistic Annotation of Hindi

We present a completed, publicly available corpus of annotated semantic relations of adpositions and case markers in Hindi. We used the multilingual SNACS annotation scheme, which has been applied to a variety of typologically diverse languages. Building on past work examining linguistic problems in SNACS annotation, we use language models to attempt automatic labelling of SNACS supersenses in Hindi and achieve results competitive with past work on English. We look towards upstream applications in semantic role labelling and extension to related languages such as Gujarati.

* 9 pages (6 main text). To appear at LREC 2022 
Viaarxiv icon

Estimating the Entropy of Linguistic Distributions

Apr 05, 2022
Aryaman Arora, Clara Meister, Ryan Cotterell

Figure 1 for Estimating the Entropy of Linguistic Distributions
Figure 2 for Estimating the Entropy of Linguistic Distributions
Figure 3 for Estimating the Entropy of Linguistic Distributions
Figure 4 for Estimating the Entropy of Linguistic Distributions

Shannon entropy is often a quantity of interest to linguists studying the communicative capacity of human language. However, entropy must typically be estimated from observed data because researchers do not have access to the underlying probability distribution that gives rise to these data. While entropy estimation is a well-studied problem in other fields, there is not yet a comprehensive exploration of the efficacy of entropy estimators for use with linguistic data. In this work, we fill this void, studying the empirical effectiveness of different entropy estimators for linguistic distributions. In a replication of two recent information-theoretic linguistic studies, we find evidence that the reported effect size is over-estimated due to over-reliance on poor entropy estimators. Finally, we end our paper with concrete recommendations for entropy estimation depending on distribution type and data availability.

* 21 pages (5 pages main text). 4 figures. Accepted to ACL 2022 
Viaarxiv icon