Alert button
Picture for Lonneke van der Plas

Lonneke van der Plas

Alert button

Can language models learn analogical reasoning? Investigating training objectives and comparisons to human performance

Oct 22, 2023
Molly R. Petersen, Lonneke van der Plas

While analogies are a common way to evaluate word embeddings in NLP, it is also of interest to investigate whether or not analogical reasoning is a task in itself that can be learned. In this paper, we test several ways to learn basic analogical reasoning, specifically focusing on analogies that are more typical of what is used to evaluate analogical reasoning in humans than those in commonly used NLP benchmarks. Our experiments find that models are able to learn analogical reasoning, even with a small amount of data. We additionally compare our models to a dataset with a human baseline, and find that after training, models approach human performance.

Viaarxiv icon

Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese

May 26, 2022
Kurt Micallef, Albert Gatt, Marc Tanti, Lonneke van der Plas, Claudia Borg

Figure 1 for Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese
Figure 2 for Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese
Figure 3 for Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese
Figure 4 for Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese

Multilingual language models such as mBERT have seen impressive cross-lingual transfer to a variety of languages, but many languages remain excluded from these models. In this paper, we analyse the effect of pre-training with monolingual data for a low-resource language that is not included in mBERT -- Maltese -- with a range of pre-training set ups. We conduct evaluations with the newly pre-trained models on three morphosyntactic tasks -- dependency parsing, part-of-speech tagging, and named-entity recognition -- and one semantic classification task -- sentiment analysis. We also present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance. Our results show that using a mixture of pre-training domains is often superior to using Wikipedia text only. We also find that a fraction of this corpus is enough to make significant leaps in performance over Wikipedia-trained models. We pre-train and compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu). The models achieve state-of-the-art performance on these tasks, despite the new corpus being considerably smaller than typically used corpora for high-resourced languages. On average, BERTu outperforms or performs competitively with mBERTu, and the largest gains are observed for higher-level tasks.

* DeepLo 2022 camera-ready version 
Viaarxiv icon

Data Augmentation for Speech Recognition in Maltese: A Low-Resource Perspective

Nov 15, 2021
Carlos Mena, Andrea DeMarco, Claudia Borg, Lonneke van der Plas, Albert Gatt

Figure 1 for Data Augmentation for Speech Recognition in Maltese: A Low-Resource Perspective
Figure 2 for Data Augmentation for Speech Recognition in Maltese: A Low-Resource Perspective
Figure 3 for Data Augmentation for Speech Recognition in Maltese: A Low-Resource Perspective
Figure 4 for Data Augmentation for Speech Recognition in Maltese: A Low-Resource Perspective

Developing speech technologies is a challenge for low-resource languages for which both annotated and raw speech data is sparse. Maltese is one such language. Recent years have seen an increased interest in the computational processing of Maltese, including speech technologies, but resources for the latter remain sparse. In this paper, we consider data augmentation techniques for improving speech recognition for such languages, focusing on Maltese as a test case. We consider three different types of data augmentation: unsupervised training, multilingual training and the use of synthesized speech as training data. The goal is to determine which of these techniques, or combination of them, is the most effective to improve speech recognition for languages where the starting point is a small corpus of approximately 7 hours of transcribed speech. Our results show that combining the three data augmentation techniques studied here lead us to an absolute WER improvement of 15% without the use of a language model.

* 30 pages; 9 tables 
Viaarxiv icon

On the Language-specificity of Multilingual BERT and the Impact of Fine-tuning

Sep 14, 2021
Marc Tanti, Lonneke van der Plas, Claudia Borg, Albert Gatt

Figure 1 for On the Language-specificity of Multilingual BERT and the Impact of Fine-tuning
Figure 2 for On the Language-specificity of Multilingual BERT and the Impact of Fine-tuning
Figure 3 for On the Language-specificity of Multilingual BERT and the Impact of Fine-tuning
Figure 4 for On the Language-specificity of Multilingual BERT and the Impact of Fine-tuning

Recent work has shown evidence that the knowledge acquired by multilingual BERT (mBERT) has two components: a language-specific and a language-neutral one. This paper analyses the relationship between them, in the context of fine-tuning on two tasks -- POS tagging and natural language inference -- which require the model to bring to bear different degrees of language-specific knowledge. Visualisations reveal that mBERT loses the ability to cluster representations by language after fine-tuning, a result that is supported by evidence from language identification experiments. However, further experiments on 'unlearning' language-specific representations using gradient reversal and iterative adversarial learning are shown not to add further improvement to the language-independent component over and above the effect of fine-tuning. The results presented here suggest that the process of fine-tuning causes a reorganisation of the model's limited representational capacity, enhancing language-independent representations at the expense of language-specific ones.

* 22 pages, 6 figures, 5 tables, to appear in BlackBoxNLP 2021 
Viaarxiv icon

Annotating for Hate Speech: The MaNeCo Corpus and Some Input from Critical Discourse Analysis

Aug 14, 2020
Stavros Assimakopoulos, Rebecca Vella Muskat, Lonneke van der Plas, Albert Gatt

Figure 1 for Annotating for Hate Speech: The MaNeCo Corpus and Some Input from Critical Discourse Analysis

This paper presents a novel scheme for the annotation of hate speech in corpora of Web 2.0 commentary. The proposed scheme is motivated by the critical analysis of posts made in reaction to news reports on the Mediterranean migration crisis and LGBTIQ+ matters in Malta, which was conducted under the auspices of the EU-funded C.O.N.T.A.C.T. project. Based on the realization that hate speech is not a clear-cut category to begin with, appears to belong to a continuum of discriminatory discourse and is often realized through the use of indirect linguistic means, it is argued that annotation schemes for its detection should refrain from directly including the label 'hate speech,' as different annotators might have different thresholds as to what constitutes hate speech and what not. In view of this, we suggest a multi-layer annotation scheme, which is pilot-tested against a binary +/- hate speech classification and appears to yield higher inter-annotator agreement. Motivating the postulation of our scheme, we then present the MaNeCo corpus on which it will eventually be used; a substantial corpus of on-line newspaper comments spanning 10 years.

* 10 pages, 1 table. Appears in Proceedings of the 12th edition of the Language Resources and Evaluation Conference (LREC'20) 
Viaarxiv icon

MASRI-HEADSET: A Maltese Corpus for Speech Recognition

Aug 13, 2020
Carlos Mena, Albert Gatt, Andrea DeMarco, Claudia Borg, Lonneke van der Plas, Amanda Muscat, Ian Padovani

Figure 1 for MASRI-HEADSET: A Maltese Corpus for Speech Recognition
Figure 2 for MASRI-HEADSET: A Maltese Corpus for Speech Recognition
Figure 3 for MASRI-HEADSET: A Maltese Corpus for Speech Recognition
Figure 4 for MASRI-HEADSET: A Maltese Corpus for Speech Recognition

Maltese, the national language of Malta, is spoken by approximately 500,000 people. Speech processing for Maltese is still in its early stages of development. In this paper, we present the first spoken Maltese corpus designed purposely for Automatic Speech Recognition (ASR). The MASRI-HEADSET corpus was developed by the MASRI project at the University of Malta. It consists of 8 hours of speech paired with text, recorded by using short text snippets in a laboratory environment. The speakers were recruited from different geographical locations all over the Maltese islands, and were roughly evenly distributed by gender. This paper also presents some initial results achieved in baseline experiments for Maltese ASR using Sphinx and Kaldi. The MASRI-HEADSET Corpus is publicly available for research/academic purposes.

* 8 pages, 2 figures, 4 tables, 1 appendix. Appears in Proceedings of the 12th edition of the Language Resources and Evaluation Conference (LREC'20) 
Viaarxiv icon

The societal and ethical relevance of computational creativity

Jul 23, 2020
Michele Loi, Eleonora Viganò, Lonneke van der Plas

Figure 1 for The societal and ethical relevance of computational creativity

In this paper, we provide a philosophical account of the value of creative systems for individuals and society. We characterize creativity in very broad philosophical terms, encompassing natural, existential, and social creative processes, such as natural evolution and entrepreneurship, and explain why creativity understood in this way is instrumental for advancing human well-being in the long term. We then explain why current mainstream AI tends to be anti-creative, which means that there are moral costs of employing this type of AI in human endeavors, although computational systems that involve creativity are on the rise. In conclusion, there is an argument for ethics to be more hospitable to creativity-enabling AI, which can also be in a trade-off with other values promoted in AI ethics, such as its explainability and accuracy.

* 4 pages, 1 figure, Eleventh International Conference on Computational Creativity, ICCC'20 
Viaarxiv icon