Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Johannes Bjerva

What is 'Typological Diversity' in NLP?

Feb 14, 2024

Esther Ploeger, Wessel Poelman, Miryam de Lhoneux, Johannes Bjerva

Figure 1 for What is 'Typological Diversity' in NLP?

Figure 2 for What is 'Typological Diversity' in NLP?

Figure 3 for What is 'Typological Diversity' in NLP?

Figure 4 for What is 'Typological Diversity' in NLP?

Abstract:The NLP research community has devoted increased attention to languages beyond English, resulting in considerable improvements for multilingual NLP. However, these improvements only apply to a small subset of the world's languages. Aiming to extend this, an increasing number of papers aspires to enhance generalizable multilingual performance across languages. To this end, linguistic typology is commonly used to motivate language selection, on the basis that a broad typological sample ought to imply generalization across a broad range of languages. These selections are often described as being 'typologically diverse'. In this work, we systematically investigate NLP research that includes claims regarding 'typological diversity'. We find there are no set definitions or criteria for such claims. We introduce metrics to approximate the diversity of language selection along several axes and find that the results vary considerably across papers. Crucially, we show that skewed language selection can lead to overestimated multilingual performance. We recommend future work to include an operationalization of 'typological diversity' that empirically justifies the diversity of language samples.

Via

Access Paper or Ask Questions

Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

Feb 05, 2024

Kushal Tatariya, Heather Lent, Johannes Bjerva, Miryam de Lhoneux

Figure 1 for Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

Figure 2 for Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

Figure 3 for Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

Figure 4 for Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

Abstract:Emotion classification is a challenging task in NLP due to the inherent idiosyncratic and subjective nature of linguistic expression, especially with code-mixed data. Pre-trained language models (PLMs) have achieved high performance for many tasks and languages, but it remains to be seen whether these models learn and are robust to the differences in emotional expression across languages. Sociolinguistic studies have shown that Hinglish speakers switch to Hindi when expressing negative emotions and to English when expressing positive emotions. To understand if language models can learn these associations, we study the effect of language on emotion prediction across 3 PLMs on a Hinglish emotion classification dataset. Using LIME and token level language ID, we find that models do learn these associations between language choice and emotional expression. Moreover, having code-mixed data present in the pre-training can augment that learning when task-specific data is scarce. We also conclude from the misclassifications that the models may overgeneralise this heuristic to other infrequent examples where this sociolinguistic phenomenon does not apply.

* 5 pages, Accepted to SIGTYP 2024 @ EACL

Via

Access Paper or Ask Questions

Multilingual Gradient Word-Order Typology from Universal Dependencies

Feb 02, 2024

Emi Baylor, Esther Ploeger, Johannes Bjerva

Figure 1 for Multilingual Gradient Word-Order Typology from Universal Dependencies

Figure 2 for Multilingual Gradient Word-Order Typology from Universal Dependencies

Figure 3 for Multilingual Gradient Word-Order Typology from Universal Dependencies

Figure 4 for Multilingual Gradient Word-Order Typology from Universal Dependencies

Abstract:While information from the field of linguistic typology has the potential to improve performance on NLP tasks, reliable typological data is a prerequisite. Existing typological databases, including WALS and Grambank, suffer from inconsistencies primarily caused by their categorical format. Furthermore, typological categorisations by definition differ significantly from the continuous nature of phenomena, as found in natural language corpora. In this paper, we introduce a new seed dataset made up of continuous-valued data, rather than categorical data, that can better reflect the variability of language. While this initial dataset focuses on word-order typology, we also present the methodology used to create the dataset, which can be easily adapted to generate data for a broader set of features and languages.

* EACL 2024

Via

Access Paper or Ask Questions

Text Embedding Inversion Attacks on Multilingual Language Models

Jan 22, 2024

Yiyi Chen, Heather Lent, Johannes Bjerva

Figure 1 for Text Embedding Inversion Attacks on Multilingual Language Models

Figure 2 for Text Embedding Inversion Attacks on Multilingual Language Models

Figure 3 for Text Embedding Inversion Attacks on Multilingual Language Models

Figure 4 for Text Embedding Inversion Attacks on Multilingual Language Models

Abstract:Representing textual information as real-numbered embeddings has become the norm in NLP. Moreover, with the rise of public interest in large language models (LLMs), Embeddings as a Service (EaaS) has rapidly gained traction as a business model. This is not without outstanding security risks, as previous research has demonstrated that sensitive data can be reconstructed from embeddings, even without knowledge of the underlying model that generated them. However, such work is limited by its sole focus on English, leaving all other languages vulnerable to attacks by malicious actors. %As many international and multilingual companies leverage EaaS, there is an urgent need for research into multilingual LLM security. To this end, this work investigates LLM security from the perspective of multilingual embedding inversion. Concretely, we define the problem of black-box multilingual and cross-lingual inversion attacks, with special attention to a cross-domain scenario. Our findings reveal that multilingual models are potentially more vulnerable to inversion attacks than their monolingual counterparts. This stems from the reduced data requirements for achieving comparable inversion performance in settings where the underlying language is not known a-priori. To our knowledge, this work is the first to delve into multilinguality within the context of inversion attacks, and our findings highlight the need for further investigation and enhanced defenses in the area of NLP Security.

* 13 pages

Via

Access Paper or Ask Questions

Patterns of Persistence and Diffusibility across the World's Languages

Jan 05, 2024

Yiyi Chen, Johannes Bjerva

Figure 1 for Patterns of Persistence and Diffusibility across the World's Languages

Figure 2 for Patterns of Persistence and Diffusibility across the World's Languages

Figure 3 for Patterns of Persistence and Diffusibility across the World's Languages

Figure 4 for Patterns of Persistence and Diffusibility across the World's Languages

Abstract:Language similarities can be caused by genetic relatedness, areal contact, universality, or chance. Colexification, i.e. a type of similarity where a single lexical form is used to convey multiple meanings, is underexplored. In our work, we shed light on the linguistic causes of cross-lingual similarity in colexification and phonology, by exploring genealogical stability (persistence) and contact-induced change (diffusibility). We construct large-scale graphs incorporating semantic, genealogical, phonological and geographical data for 1,966 languages. We then show the potential of this resource, by investigating several established hypotheses from previous work in linguistics, while proposing new ones. Our results strongly support a previously established hypothesis in the linguistic literature, while offering contradicting evidence to another. Our large scale resource opens for further research across disciplines, e.g.~in multilingual NLP and comparative linguistics.

* 21 pages

Via

Access Paper or Ask Questions

Patterns of Closeness and Abstractness in Colexifications: The Case of Indigenous Languages in the Americas

Dec 18, 2023

Yiyi Chen, Johannes Bjerva

Abstract:Colexification refers to linguistic phenomena where multiple concepts (meanings) are expressed by the same lexical form, such as polysemy or homophony. Colexifications have been found to be pervasive across languages and cultures. The problem of concreteness/abstractness of concepts is interdisciplinary, studied from a cognitive standpoint in linguistics, psychology, psycholinguistics, neurophysiology, etc. In this paper, we hypothesize that concepts that are closer in concreteness/abstractness are more likey to colexify, and we test the hypothesis across indigenous languages in Americas.

* 3 pages, 2 figures, 1 table, AmericasNLP 2023

Via

Access Paper or Ask Questions

CreoleVal: Multilingual Multitask Benchmarks for Creoles

Oct 30, 2023

Heather Lent, Kushal Tatariya, Raj Dabre, Yiyi Chen, Marcell Fekete, Esther Ploeger, Li Zhou, Hans Erik Heje, Diptesh Kanojia, Paul Belony(+7 more)

Abstract:Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research. While the genealogical ties between Creoles and other highly-resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks, covering up to 28 Creole languages; it is an aggregate of brand new development datasets for machine comprehension, relation classification, and machine translation for Creoles, in addition to a practical gateway to a handful of preexisting benchmarks. For each benchmark, we conduct baseline experiments in a zero-shot setting in order to further ascertain the capabilities and limitations of transfer learning for Creoles. Ultimately, the goal of CreoleVal is to empower research on Creoles in NLP and computational linguistics. We hope this resource will contribute to technological inclusion for Creole language users around the globe.

Via

Access Paper or Ask Questions

The Past, Present, and Future of Typological Databases in NLP

Oct 20, 2023

Emi Baylor, Esther Ploeger, Johannes Bjerva

Abstract:Typological information has the potential to be beneficial in the development of NLP models, particularly for low-resource languages. Unfortunately, current large-scale typological databases, notably WALS and Grambank, are inconsistent both with each other and with other sources of typological information, such as linguistic grammars. Some of these inconsistencies stem from coding errors or linguistic variation, but many of the disagreements are due to the discrete categorical nature of these databases. We shed light on this issue by systematically exploring disagreements across typological databases and resources, and their uses in NLP, covering the past and present. We next investigate the future of such work, offering an argument that a continuous view of typological features is clearly beneficial, echoing recommendations from linguistics. We propose that such a view of typology has significant potential in the future, including in language modeling in low-resource scenarios.

* Accepted to EMNLP Findings

Via

Access Paper or Ask Questions

A Framework for Responsible Development of Automated Student Feedback with Generative AI

Aug 29, 2023

Euan D Lindsay, Aditya Johri, Johannes Bjerva

Figure 1 for A Framework for Responsible Development of Automated Student Feedback with Generative AI

Abstract:Providing rich feedback to students is essential for supporting student learning. Recent advances in generative AI, particularly within large language modelling (LLM), provide the opportunity to deliver repeatable, scalable and instant automatically generated feedback to students, making abundant a previously scarce and expensive learning resource. Such an approach is feasible from a technical perspective due to these recent advances in Artificial Intelligence (AI) and Natural Language Processing (NLP); while the potential upside is a strong motivator, doing so introduces a range of potential ethical issues that must be considered as we apply these technologies. The attractiveness of AI systems is that they can effectively automate the most mundane tasks; but this risks introducing a "tyranny of the majority", where the needs of minorities in the long tail are overlooked because they are difficult to automate. Developing machine learning models that can generate valuable and authentic feedback requires the input of human domain experts. The choices we make in capturing this expertise -- whose, which, when, and how -- will have significant consequences for the nature of the resulting feedback. How we maintain our models will affect how that feedback remains relevant given temporal changes in context, theory, and prior learning profiles of student cohorts. These questions are important from an ethical perspective; but they are also important from an operational perspective. Unless they can be answered, our AI generated systems will lack the trust necessary for them to be useful features in the contemporary learning environment. This article will outline the frontiers of automated feedback, identify the ethical issues involved in the provision of automated feedback and present a framework to assist academics to develop such systems responsibly.

* 10 pages, under review at IEEE TLT

Via

Access Paper or Ask Questions

Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness

Jun 05, 2023

Yiyi Chen, Johannes Bjerva

Abstract:Colexification refers to the linguistic phenomenon where a single lexical form is used to convey multiple meanings. By studying cross-lingual colexifications, researchers have gained valuable insights into fields such as psycholinguistics and cognitive sciences [Jackson et al.,2019]. While several multilingual colexification datasets exist, there is untapped potential in using this information to bootstrap datasets across such semantic features. In this paper, we aim to demonstrate how colexifications can be leveraged to create such cross-lingual datasets. We showcase curation procedures which result in a dataset covering 142 languages across 21 language families across the world. The dataset includes ratings of concreteness and affectiveness, mapped with phonemes and phonological features. We further analyze the dataset along different dimensions to demonstrate potential of the proposed procedures in facilitating further interdisciplinary research in psychology, cognitive science, and multilingual natural language processing (NLP). Based on initial investigations, we observe that i) colexifications that are closer in concreteness/affectiveness are more likely to colexify; ii) certain initial/last phonemes are significantly correlated with concreteness/affectiveness intra language families, such as /k/ as the initial phoneme in both Turkic and Tai-Kadai correlated with concreteness, and /p/ in Dravidian and Sino-Tibetan correlated with Valence; iii) the type-to-token ratio (TTR) of phonemes are positively correlated with concreteness across several language families, while the length of phoneme segments are negatively correlated with concreteness; iv) certain phonological features are negatively correlated with concreteness across languages. The dataset is made public online for further research.

* 13 pages, 4 figures, accepted to SIGMORPHON 2023

Via

Access Paper or Ask Questions