Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Miryam de Lhoneux

Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

Feb 05, 2024

Kushal Tatariya, Heather Lent, Johannes Bjerva, Miryam de Lhoneux

Figure 1 for Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

Figure 2 for Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

Figure 3 for Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

Figure 4 for Sociolinguistically Informed Interpretability: A Case Study on Hinglish Emotion Classification

Abstract:Emotion classification is a challenging task in NLP due to the inherent idiosyncratic and subjective nature of linguistic expression, especially with code-mixed data. Pre-trained language models (PLMs) have achieved high performance for many tasks and languages, but it remains to be seen whether these models learn and are robust to the differences in emotional expression across languages. Sociolinguistic studies have shown that Hinglish speakers switch to Hindi when expressing negative emotions and to English when expressing positive emotions. To understand if language models can learn these associations, we study the effect of language on emotion prediction across 3 PLMs on a Hinglish emotion classification dataset. Using LIME and token level language ID, we find that models do learn these associations between language choice and emotional expression. Moreover, having code-mixed data present in the pre-training can augment that learning when task-specific data is scarce. We also conclude from the misclassifications that the models may overgeneralise this heuristic to other infrequent examples where this sociolinguistic phenomenon does not apply.

* 5 pages, Accepted to SIGTYP 2024 @ EACL

Via

Access Paper or Ask Questions

CreoleVal: Multilingual Multitask Benchmarks for Creoles

Oct 30, 2023

Heather Lent, Kushal Tatariya, Raj Dabre, Yiyi Chen, Marcell Fekete, Esther Ploeger, Li Zhou, Hans Erik Heje, Diptesh Kanojia, Paul Belony(+7 more)

Abstract:Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research. While the genealogical ties between Creoles and other highly-resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks, covering up to 28 Creole languages; it is an aggregate of brand new development datasets for machine comprehension, relation classification, and machine translation for Creoles, in addition to a practical gateway to a handful of preexisting benchmarks. For each benchmark, we conduct baseline experiments in a zero-shot setting in order to further ascertain the capabilities and limitations of transfer learning for Creoles. Ultimately, the goal of CreoleVal is to empower research on Creoles in NLP and computational linguistics. We hope this resource will contribute to technological inclusion for Creole language users around the globe.

Via

Access Paper or Ask Questions

A Two-Sided Discussion of Preregistration of NLP Research

Feb 20, 2023

Anders Søgaard, Daniel Hershcovich, Miryam de Lhoneux

Abstract:Van Miltenburg et al. (2021) suggest NLP research should adopt preregistration to prevent fishing expeditions and to promote publication of negative results. At face value, this is a very reasonable suggestion, seemingly solving many methodological problems with NLP research. We discuss pros and cons -- some old, some new: a) Preregistration is challenged by the practice of retrieving hypotheses after the results are known; b) preregistration may bias NLP toward confirmatory research; c) preregistration must allow for reclassification of research as exploratory; d) preregistration may increase publication bias; e) preregistration may increase flag-planting; f) preregistration may increase p-hacking; and finally, g) preregistration may make us less risk tolerant. We cast our discussion as a dialogue, presenting both sides of the debate.

* EACL 2023

Via

Access Paper or Ask Questions

Language Modelling with Pixels

Jul 14, 2022

Phillip Rust, Jonas F. Lotz, Emanuele Bugliarello, Elizabeth Salesky, Miryam de Lhoneux, Desmond Elliott

Figure 1 for Language Modelling with Pixels

Figure 2 for Language Modelling with Pixels

Figure 3 for Language Modelling with Pixels

Figure 4 for Language Modelling with Pixels

Abstract:Language models are defined over a finite set of inputs, which creates a vocabulary bottleneck when we attempt to scale the number of supported languages. Tackling this bottleneck results in a trade-off between what can be represented in the embedding matrix and computational issues in the output layer. This paper introduces PIXEL, the Pixel-based Encoder of Language, which suffers from neither of these issues. PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels. PIXEL is trained to reconstruct the pixels of masked patches, instead of predicting a distribution over tokens. We pretrain the 86M parameter PIXEL model on the same English data as BERT and evaluate on syntactic and semantic tasks in typologically diverse languages, including various non-Latin scripts. We find that PIXEL substantially outperforms BERT on syntactic and semantic processing tasks on scripts that are not found in the pretraining data, but PIXEL is slightly weaker than BERT when working with Latin scripts. Furthermore, we find that PIXEL is more robust to noisy text inputs than BERT, further confirming the benefits of modelling language with pixels.

* work in progress

Via

Access Paper or Ask Questions

What a Creole Wants, What a Creole Needs

Jun 01, 2022

Heather Lent, Kelechi Ogueji, Miryam de Lhoneux, Orevaoghene Ahia, Anders Søgaard

Figure 1 for What a Creole Wants, What a Creole Needs

Figure 2 for What a Creole Wants, What a Creole Needs

Figure 3 for What a Creole Wants, What a Creole Needs

Abstract:In recent years, the natural language processing (NLP) community has given increased attention to the disparity of efforts directed towards high-resource languages over low-resource ones. Efforts to remedy this delta often begin with translations of existing English datasets into other languages. However, this approach ignores that different language communities have different needs. We consider a group of low-resource languages, Creole languages. Creoles are both largely absent from the NLP literature, and also often ignored by society at large due to stigma, despite these languages having sizable and vibrant communities. We demonstrate, through conversations with Creole experts and surveys of Creole-speaking communities, how the things needed from language technology can change dramatically from one language to another, even when the languages are considered to be very similar to each other, as with Creoles. We discuss the prominent themes arising from these conversations, and ultimately demonstrate that useful language technology cannot be built without involving the relevant community.

* LREC 2022

Via

Access Paper or Ask Questions

Challenges and Strategies in Cross-Cultural NLP

Mar 18, 2022

Daniel Hershcovich, Stella Frank, Heather Lent, Miryam de Lhoneux, Mostafa Abdou, Stephanie Brandl, Emanuele Bugliarello, Laura Cabello Piqueras, Ilias Chalkidis, Ruixiang Cui(+4 more)

Figure 1 for Challenges and Strategies in Cross-Cultural NLP

Abstract:Various efforts in the Natural Language Processing (NLP) community have been made to accommodate linguistic diversity and serve speakers of many different languages. However, it is important to acknowledge that speakers and the content they produce and require, vary not just by language, but also by culture. Although language and culture are tightly linked, there are important differences. Analogous to cross-lingual and multilingual NLP, cross-cultural and multicultural NLP considers these differences in order to better serve users of NLP systems. We propose a principled framework to frame these efforts, and survey existing and potential strategies.

* ACL 2022 - Theme track

Via

Access Paper or Ask Questions

Finding Structural Knowledge in Multimodal-BERT

Mar 17, 2022

Victor Milewski, Miryam de Lhoneux, Marie-Francine Moens

Figure 1 for Finding Structural Knowledge in Multimodal-BERT

Figure 2 for Finding Structural Knowledge in Multimodal-BERT

Figure 3 for Finding Structural Knowledge in Multimodal-BERT

Figure 4 for Finding Structural Knowledge in Multimodal-BERT

Abstract:In this work, we investigate the knowledge learned in the embeddings of multimodal-BERT models. More specifically, we probe their capabilities of storing the grammatical structure of linguistic data and the structure learned over objects in visual data. To reach that goal, we first make the inherent structure of language and visuals explicit by a dependency parse of the sentences that describe the image and by the dependencies between the object regions in the image, respectively. We call this explicit visual structure the \textit{scene tree}, that is based on the dependency tree of the language description. Extensive probing experiments show that the multimodal-BERT models do not encode these scene trees.Code available at \url{https://github.com/VSJMilewski/multimodal-probes}.

* Accepted at ACL 2022

Via

Access Paper or Ask Questions

Zero-Shot Dependency Parsing with Worst-Case Aware Automated Curriculum Learning

Mar 16, 2022

Miryam de Lhoneux, Sheng Zhang, Anders Søgaard

Figure 1 for Zero-Shot Dependency Parsing with Worst-Case Aware Automated Curriculum Learning

Figure 2 for Zero-Shot Dependency Parsing with Worst-Case Aware Automated Curriculum Learning

Figure 3 for Zero-Shot Dependency Parsing with Worst-Case Aware Automated Curriculum Learning

Figure 4 for Zero-Shot Dependency Parsing with Worst-Case Aware Automated Curriculum Learning

Abstract:Large multilingual pretrained language models such as mBERT and XLM-RoBERTa have been found to be surprisingly effective for cross-lingual transfer of syntactic parsing models (Wu and Dredze 2019), but only between related languages. However, source and training languages are rarely related, when parsing truly low-resource languages. To close this gap, we adopt a method from multi-task learning, which relies on automated curriculum learning, to dynamically optimize for parsing performance on outlier languages. We show that this approach is significantly better than uniform and size-proportional sampling in the zero-shot setting.

* ACL 2022

Via

Access Paper or Ask Questions

Parsing with Pretrained Language Models, Multiple Datasets, and Dataset Embeddings

Dec 07, 2021

Rob van der Goot, Miryam de Lhoneux

Figure 1 for Parsing with Pretrained Language Models, Multiple Datasets, and Dataset Embeddings

Figure 2 for Parsing with Pretrained Language Models, Multiple Datasets, and Dataset Embeddings

Figure 3 for Parsing with Pretrained Language Models, Multiple Datasets, and Dataset Embeddings

Figure 4 for Parsing with Pretrained Language Models, Multiple Datasets, and Dataset Embeddings

Abstract:With an increase of dataset availability, the potential for learning from a variety of data sources has increased. One particular method to improve learning from multiple data sources is to embed the data source during training. This allows the model to learn generalizable features as well as distinguishing features between datasets. However, these dataset embeddings have mostly been used before contextualized transformer-based embeddings were introduced in the field of Natural Language Processing. In this work, we compare two methods to embed datasets in a transformer-based multilingual dependency parser, and perform an extensive evaluation. We show that: 1) embedding the dataset is still beneficial with these models 2) performance increases are highest when embedding the dataset at the encoder level 3) unsurprisingly, we confirm that performance increases are highest for small datasets and datasets with a low baseline score. 4) we show that training on the combination of all datasets performs similarly to designing smaller clusters based on language-relatedness.

* Accepted to TLT at SyntaxFest 2021

Via

Access Paper or Ask Questions

On Language Models for Creoles

Sep 13, 2021

Heather Lent, Emanuele Bugliarello, Miryam de Lhoneux, Chen Qiu, Anders Søgaard

Figure 1 for On Language Models for Creoles

Figure 2 for On Language Models for Creoles

Figure 3 for On Language Models for Creoles

Figure 4 for On Language Models for Creoles

Abstract:Creole languages such as Nigerian Pidgin English and Haitian Creole are under-resourced and largely ignored in the NLP literature. Creoles typically result from the fusion of a foreign language with multiple local languages, and what grammatical and lexical features are transferred to the creole is a complex process. While creoles are generally stable, the prominence of some features may be much stronger with certain demographics or in some linguistic situations. This paper makes several contributions: We collect existing corpora and release models for Haitian Creole, Nigerian Pidgin English, and Singaporean Colloquial English. We evaluate these models on intrinsic and extrinsic tasks. Motivated by the above literature, we compare standard language models with distributionally robust ones and find that, somewhat surprisingly, the standard language models are superior to the distributionally robust ones. We investigate whether this is an effect of over-parameterization or relative distributional stability, and find that the difference persists in the absence of over-parameterization, and that drift is limited, confirming the relative stability of creole languages.

* CoNLL 2021

Via

Access Paper or Ask Questions