Alert button
Picture for Robert Litschko

Robert Litschko

Alert button

Vicinal Risk Minimization for Few-Shot Cross-lingual Transfer in Abusive Language Detection

Nov 03, 2023
Gretel Liz De la Peña Sarracén, Paolo Rosso, Robert Litschko, Goran Glavaš, Simone Paolo Ponzetto

Cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results. However, the scarcity of resources in target languages remains a challenge. In this work, we resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection. For data augmentation, we analyze two existing techniques based on vicinal risk minimization and propose MIXAG, a novel data augmentation method which interpolates pairs of instances based on the angle of their representations. Our experiments involve seven languages typologically distinct from English and three different domains. The results reveal that the data augmentation strategies can enhance few-shot cross-lingual abusive language detection. Specifically, we observe that consistently in all target languages, MIXAG improves significantly in multidomain and multilingual environments. Finally, we show through an error analysis how the domain adaptation can favour the class of abusive texts (reducing false negatives), but at the same time, declines the precision of the abusive language detection model.

* Accepted at EMNLP 2023 (Main Conference) 
Viaarxiv icon

Establishing Trustworthiness: Rethinking Tasks and Model Evaluation

Oct 23, 2023
Robert Litschko, Max Müller-Eberstein, Rob van der Goot, Leon Weber, Barbara Plank

Figure 1 for Establishing Trustworthiness: Rethinking Tasks and Model Evaluation

Language understanding is a multi-faceted cognitive capability, which the Natural Language Processing (NLP) community has striven to model computationally for decades. Traditionally, facets of linguistic intelligence have been compartmentalized into tasks with specialized model architectures and corresponding evaluation protocols. With the advent of large language models (LLMs) the community has witnessed a dramatic shift towards general purpose, task-agnostic approaches powered by generative models. As a consequence, the traditional compartmentalized notion of language tasks is breaking down, followed by an increasing challenge for evaluation and analysis. At the same time, LLMs are being deployed in more real-world scenarios, including previously unforeseen zero-shot setups, increasing the need for trustworthy and reliable systems. Therefore, we argue that it is time to rethink what constitutes tasks and model evaluation in NLP, and pursue a more holistic view on language, placing trustworthiness at the center. Towards this goal, we review existing compartmentalized approaches for understanding the origins of a model's functional capacity, and provide recommendations for more multi-faceted evaluation protocols.

* Accepted at EMNLP 2023 (Main Conference), camera-ready 
Viaarxiv icon

Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?

Sep 04, 2023
Leon Weber-Genzel, Robert Litschko, Ekaterina Artemova, Barbara Plank

Figure 1 for Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?
Figure 2 for Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?
Figure 3 for Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?
Figure 4 for Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?

Instruction-tuning has become an integral part of training pipelines for Large Language Models (LLMs) and has been shown to yield strong performance gains. In an orthogonal line of research, Annotation Error Detection (AED) has emerged as a tool for detecting quality issues of gold-standard labels. But so far, the application of AED methods is limited to discriminative settings. It is an open question how well AED methods generalize to generative settings which are becoming widespread via generative LLMs. In this work, we present a first and new benchmark for AED on instruction-tuning data: Donkii. It encompasses three instruction-tuning datasets enriched with annotations by experts and semi-automatic methods. We find that all three datasets contain clear-cut errors that sometimes directly propagate into instruction-tuned LLMs. We propose four AED baselines for the generative setting and evaluate them comprehensively on the newly introduced dataset. Our results demonstrate that choosing the right AED method and model size is indeed crucial, thereby deriving practical recommendations. To gain insights, we provide a first case-study to examine how the quality of the instruction-tuning datasets influences downstream performance.

Viaarxiv icon

A General-Purpose Multilingual Document Encoder

May 11, 2023
Onur Galoğlu, Robert Litschko, Goran Glavaš

Figure 1 for A General-Purpose Multilingual Document Encoder
Figure 2 for A General-Purpose Multilingual Document Encoder
Figure 3 for A General-Purpose Multilingual Document Encoder
Figure 4 for A General-Purpose Multilingual Document Encoder

Massively multilingual pretrained transformers (MMTs) have tremendously pushed the state of the art on multilingual NLP and cross-lingual transfer of NLP models in particular. While a large body of work leveraged MMTs to mine parallel data and induce bilingual document embeddings, much less effort has been devoted to training general-purpose (massively) multilingual document encoder that can be used for both supervised and unsupervised document-level tasks. In this work, we pretrain a massively multilingual document encoder as a hierarchical transformer model (HMDE) in which a shallow document transformer contextualizes sentence representations produced by a state-of-the-art pretrained multilingual sentence encoder. We leverage Wikipedia as a readily available source of comparable documents for creating training data, and train HMDE by means of a cross-lingual contrastive objective, further exploiting the category hierarchy of Wikipedia for creation of difficult negatives. We evaluate the effectiveness of HMDE in two arguably most common and prominent cross-lingual document-level tasks: (1) cross-lingual transfer for topical document classification and (2) cross-lingual document retrieval. HMDE is significantly more effective than (i) aggregations of segment-based representations and (ii) multilingual Longformer. Crucially, owing to its massively multilingual lower transformer, HMDE successfully generalizes to languages unseen in document-level pretraining. We publicly release our code and models at https://github.com/ogaloglu/pre-training-multilingual-document-encoders .

Viaarxiv icon

Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched Data

May 09, 2023
Robert Litschko, Ekaterina Artemova, Barbara Plank

Figure 1 for Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched Data
Figure 2 for Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched Data
Figure 3 for Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched Data
Figure 4 for Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched Data

Transferring information retrieval (IR) models from a high-resource language (typically English) to other languages in a zero-shot fashion has become a widely adopted approach. In this work, we show that the effectiveness of zero-shot rankers diminishes when queries and documents are present in different languages. Motivated by this, we propose to train ranking models on artificially code-switched data instead, which we generate by utilizing bilingual lexicons. To this end, we experiment with lexicons induced from (1) cross-lingual word embeddings and (2) parallel Wikipedia page titles. We use the mMARCO dataset to extensively evaluate reranking models on 36 language pairs spanning Monolingual IR (MoIR), Cross-lingual IR (CLIR), and Multilingual IR (MLIR). Our results show that code-switching can yield consistent and substantial gains of 5.1 MRR@10 in CLIR and 3.9 MRR@10 in MLIR, while maintaining stable performance in MoIR. Encouragingly, the gains are especially pronounced for distant languages (up to 2x absolute gain). We further show that our approach is robust towards the ratio of code-switched tokens and also extends to unseen languages. Our results demonstrate that training on code-switched data is a cheap and effective way of generalizing zero-shot rankers for cross-lingual and multilingual retrieval.

* Accepted to Findings of ACL 2023 
Viaarxiv icon

ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System

May 30, 2022
Chia-Chien Hung, Tommaso Green, Robert Litschko, Tornike Tsereteli, Sotaro Takeshita, Marco Bombieri, Goran Glavaš, Simone Paolo Ponzetto

Figure 1 for ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System
Figure 2 for ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System
Figure 3 for ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System
Figure 4 for ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System

This paper introduces our proposed system for the MIA Shared Task on Cross-lingual Open-retrieval Question Answering (COQA). In this challenging scenario, given an input question the system has to gather evidence documents from a multilingual pool and generate from them an answer in the language of the question. We devised several approaches combining different model variants for three main components: Data Augmentation, Passage Retrieval, and Answer Generation. For passage retrieval, we evaluated the monolingual BM25 ranker against the ensemble of re-rankers based on multilingual pretrained language models (PLMs) and also variants of the shared task baseline, re-training it from scratch using a recently introduced contrastive loss that maintains a strong gradient signal throughout training by means of mixed negative samples. For answer generation, we focused on language- and domain-specialization by means of continued language model (LM) pretraining of existing multilingual encoders. Additionally, for both passage retrieval and answer generation, we augmented the training data provided by the task organizers with automatically generated question-answer pairs created from Wikipedia passages to mitigate the issue of data scarcity, particularly for the low-resource languages for which no training data were provided. Our results show that language- and domain-specialization as well as data augmentation help, especially for low-resource languages.

Viaarxiv icon

Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual Retrieval

Apr 05, 2022
Robert Litschko, Ivan Vulić, Goran Glavaš

Figure 1 for Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual Retrieval
Figure 2 for Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual Retrieval
Figure 3 for Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual Retrieval
Figure 4 for Parameter-Efficient Neural Reranking for Cross-Lingual and Multilingual Retrieval

State-of-the-art neural (re)rankers are notoriously data hungry which - given the lack of large-scale training data in languages other than English - makes them rarely used in multilingual and cross-lingual retrieval settings. Current approaches therefore typically transfer rankers trained on English data to other languages and cross-lingual setups by means of multilingual encoders: they fine-tune all the parameters of a pretrained massively multilingual Transformer (MMT, e.g., multilingual BERT) on English relevance judgments and then deploy it in the target language. In this work, we show that two parameter-efficient approaches to cross-lingual transfer, namely Sparse Fine-Tuning Masks (SFTMs) and Adapters, allow for a more lightweight and more effective zero-shot transfer to multilingual and cross-lingual retrieval tasks. We first train language adapters (or SFTMs) via Masked Language Modelling and then train retrieval (i.e., reranking) adapters (SFTMs) on top while keeping all other parameters fixed. At inference, this modular design allows us to compose the ranker by applying the task adapter (or SFTM) trained with source language data together with the language adapter (or SFTM) of a target language. Besides improved transfer performance, these two approaches offer faster ranker training, with only a fraction of parameters being updated compared to full MMT fine-tuning. We benchmark our models on the CLEF-2003 benchmark, showing that our parameter-efficient methods outperform standard zero-shot transfer with full MMT fine-tuning, while enabling modularity and reducing training times. Further, we show on the example of Swahili and Somali that, for low(er)-resource languages, our parameter-efficient neural re-rankers can improve the ranking of the competitive machine translation-based ranker.

Viaarxiv icon

On Cross-Lingual Retrieval with Multilingual Text Encoders

Dec 21, 2021
Robert Litschko, Ivan Vulić, Simone Paolo Ponzetto, Goran Glavaš

Figure 1 for On Cross-Lingual Retrieval with Multilingual Text Encoders
Figure 2 for On Cross-Lingual Retrieval with Multilingual Text Encoders
Figure 3 for On Cross-Lingual Retrieval with Multilingual Text Encoders
Figure 4 for On Cross-Lingual Retrieval with Multilingual Text Encoders

In this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a number of diverse language pairs. We first treat these models as multilingual text encoders and benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR -- a setup with no relevance judgments for IR-specific fine-tuning -- pretrained multilingual encoders on average fail to significantly outperform earlier models based on CLWEs. For sentence-level retrieval, we do obtain state-of-the-art performance: the peak scores, however, are met by multilingual encoders that have been further specialized, in a supervised fashion, for sentence understanding tasks, rather than using their vanilla 'off-the-shelf' variants. Following these results, we introduce localized relevance matching for document-level CLIR, where we independently score a query against document sections. In the second part, we evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments. Our results show that supervised re-ranking rarely improves the performance of multilingual transformers as unsupervised base rankers. Finally, only with in-domain contrastive fine-tuning (i.e., same domain, only language transfer), we manage to improve the ranking quality. We uncover substantial empirical differences between cross-lingual retrieval results and results of (zero-shot) cross-lingual transfer for monolingual retrieval in target languages, which point to "monolingual overfitting" of retrieval models trained on monolingual data.

* to appear in IRJ ECIR 2021 Special Issue. arXiv admin note: substantial text overlap with arXiv:2101.08370 
Viaarxiv icon

Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval

Jan 21, 2021
Robert Litschko, Ivan Vulić, Simone Paolo Ponzetto, Goran Glavaš

Figure 1 for Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval
Figure 2 for Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval
Figure 3 for Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval
Figure 4 for Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval

Pretrained multilingual text encoders based on neural Transformer architectures, such as multilingual BERT (mBERT) and XLM, have achieved strong performance on a myriad of language understanding tasks. Consequently, they have been adopted as a go-to paradigm for multilingual and cross-lingual representation learning and transfer, rendering cross-lingual word embeddings (CLWEs) effectively obsolete. However, questions remain to which extent this finding generalizes 1) to unsupervised settings and 2) for ad-hoc cross-lingual IR (CLIR) tasks. Therefore, in this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a large number of language pairs. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR -- a setup with no relevance judgments for IR-specific fine-tuning -- pretrained encoders fail to significantly outperform models based on CLWEs. For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved. However, the peak performance is not met using the general-purpose multilingual text encoders `off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.

* accepted at ECIR'21 (preprint) 
Viaarxiv icon