Alert button
Picture for Xiaoyu Shen

Xiaoyu Shen

Alert button

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

Sep 14, 2023
David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O. Alabi, Yanke Mao, Haonan Gao, Annie En-Shiun Lee

Despite the progress we have recorded in the last few years in multilingual natural language processing, evaluation is typically limited to a small set of languages with available datasets which excludes a large number of low-resource languages. In this paper, we created SIB-200 -- a large-scale open-sourced benchmark dataset for topic classification in 200 languages and dialects to address the lack of evaluation dataset for Natural Language Understanding (NLU). For many of the languages covered in SIB-200, this is the first publicly available evaluation dataset for NLU. The dataset is based on Flores-200 machine translation corpus. We annotated the English portion of the dataset and extended the sentence-level annotation to the remaining 203 languages covered in the corpus. Despite the simplicity of this task, our evaluation in full-supervised setting, cross-lingual transfer setting and prompting of large language model setting show that there is still a large gap between the performance of high-resource and low-resource languages when multilingual evaluation is scaled to numerous world languages. We found that languages unseen during the pre-training of multilingual language models, under-represented language families (like Nilotic and Altantic-Congo), and languages from the regions of Africa, Americas, Oceania and South East Asia, often have the lowest performance on our topic classification dataset. We hope our dataset will encourage a more inclusive evaluation of multilingual language models on a more diverse set of languages. https://github.com/dadelani/sib-200

* under submission 
Viaarxiv icon

Weaker Than You Think: A Critical Look atWeakly Supervised Learning

May 27, 2023
Dawei Zhu, Xiaoyu Shen, Marius Mosbach, Andreas Stephan, Dietrich Klakow

Figure 1 for Weaker Than You Think: A Critical Look atWeakly Supervised Learning
Figure 2 for Weaker Than You Think: A Critical Look atWeakly Supervised Learning
Figure 3 for Weaker Than You Think: A Critical Look atWeakly Supervised Learning
Figure 4 for Weaker Than You Think: A Critical Look atWeakly Supervised Learning

Weakly supervised learning is a popular approach for training machine learning models in low-resource settings. Instead of requesting high-quality yet costly human annotations, it allows training models with noisy annotations obtained from various weak sources. Recently, many sophisticated approaches have been proposed for robust training under label noise, reporting impressive results. In this paper, we revisit the setup of these approaches and find that the benefits brought by these approaches are significantly overestimated. Specifically, we find that the success of existing weakly supervised learning approaches heavily relies on the availability of clean validation samples which, as we show, can be leveraged much more efficiently by simply training on them. After using these clean labels in training, the advantages of using these sophisticated approaches are mostly wiped out. This remains true even when reducing the size of the available clean data to just five samples per class, making these approaches impractical. To understand the true value of weakly supervised learning, we thoroughly analyse diverse NLP datasets and tasks to ascertain when and why weakly supervised approaches work, and provide recommendations for future research.

* ACL 2023 
Viaarxiv icon

Is Translation Helpful? An Empirical Analysis of Cross-Lingual Transfer in Low-Resource Dialog Generation

May 21, 2023
Lei Shen, Shuai Yu, Xiaoyu Shen

Figure 1 for Is Translation Helpful? An Empirical Analysis of Cross-Lingual Transfer in Low-Resource Dialog Generation
Figure 2 for Is Translation Helpful? An Empirical Analysis of Cross-Lingual Transfer in Low-Resource Dialog Generation
Figure 3 for Is Translation Helpful? An Empirical Analysis of Cross-Lingual Transfer in Low-Resource Dialog Generation
Figure 4 for Is Translation Helpful? An Empirical Analysis of Cross-Lingual Transfer in Low-Resource Dialog Generation

Cross-lingual transfer is important for developing high-quality chatbots in multiple languages due to the strongly imbalanced distribution of language resources. A typical approach is to leverage off-the-shelf machine translation (MT) systems to utilize either the training corpus or developed models from high-resource languages. In this work, we investigate whether it is helpful to utilize MT at all in this task. To do so, we simulate a low-resource scenario assuming access to limited Chinese dialog data in the movie domain and large amounts of English dialog data from multiple domains. Experiments show that leveraging English dialog corpora can indeed improve the naturalness, relevance and cross-domain transferability in Chinese. However, directly using English dialog corpora in its original form, surprisingly, is better than using its translated version. As the topics and wording habits in daily conversations are strongly culture-dependent, MT can reinforce the bias from high-resource languages, yielding unnatural generations in the target language. Considering the cost of translating large amounts of text and the strong effects of the translation quality, we suggest future research should rather focus on utilizing the original English data for cross-lingual transfer in dialog generation. We perform extensive human evaluations and ablation studies. The analysis results, together with the collected dataset, are presented to draw attention towards this area and benefit future research.

Viaarxiv icon

xPQA: Cross-Lingual Product Question Answering across 12 Languages

May 16, 2023
Xiaoyu Shen, Akari Asai, Bill Byrne, Adrià de Gispert

Figure 1 for xPQA: Cross-Lingual Product Question Answering across 12 Languages
Figure 2 for xPQA: Cross-Lingual Product Question Answering across 12 Languages
Figure 3 for xPQA: Cross-Lingual Product Question Answering across 12 Languages
Figure 4 for xPQA: Cross-Lingual Product Question Answering across 12 Languages

Product Question Answering (PQA) systems are key in e-commerce applications to provide responses to customers' questions as they shop for products. While existing work on PQA focuses mainly on English, in practice there is need to support multiple customer languages while leveraging product information available in English. To study this practical industrial task, we present xPQA, a large-scale annotated cross-lingual PQA dataset in 12 languages across 9 branches, and report results in (1) candidate ranking, to select the best English candidate containing the information to answer a non-English question; and (2) answer generation, to generate a natural-sounding non-English answer based on the selected English candidate. We evaluate various approaches involving machine translation at runtime or offline, leveraging multilingual pre-trained LMs, and including or excluding xPQA training data. We find that (1) In-domain data is essential as cross-lingual rankers trained on other domains perform poorly on the PQA task; (2) Candidate ranking often prefers runtime-translation approaches while answer generation prefers multilingual approaches; (3) Translating offline to augment multilingual models helps candidate ranking mainly on languages with non-Latin scripts; and helps answer generation mainly on languages with Latin scripts. Still, there remains a significant performance gap between the English and the cross-lingual test sets.

* ACL 2023 industry track. Dataset available in https://github.com/amazon-science/contextual-product-qa 
Viaarxiv icon

MDIA: A Benchmark for Multilingual Dialogue Generation in 46 Languages

Aug 27, 2022
Qingyu Zhang, Xiaoyu Shen, Ernie Chang, Jidong Ge, Pengke Chen

Figure 1 for MDIA: A Benchmark for Multilingual Dialogue Generation in 46 Languages
Figure 2 for MDIA: A Benchmark for Multilingual Dialogue Generation in 46 Languages
Figure 3 for MDIA: A Benchmark for Multilingual Dialogue Generation in 46 Languages
Figure 4 for MDIA: A Benchmark for Multilingual Dialogue Generation in 46 Languages

Owing to the lack of corpora for low-resource languages, current works on dialogue generation have mainly focused on English. In this paper, we present mDIA, the first large-scale multilingual benchmark for dialogue generation across low- to high-resource languages. It covers real-life conversations in 46 languages across 19 language families. We present baseline results obtained by fine-tuning the multilingual, non-dialogue-focused pre-trained model mT5 as well as English-centric, dialogue-focused pre-trained chatbot DialoGPT. The results show that mT5-based models perform better on sacreBLEU and BertScore but worse on diversity. Even though promising results are found in few-shot and zero-shot scenarios, there is a large gap between the generation quality in English and other languages. We hope that the release of mDIA could encourage more works on multilingual dialogue generation to promote language diversity.

* The dataset and processing scripts are available in https://github.com/DoctorDream/mDIA 
Viaarxiv icon

Low-Resource Dense Retrieval for Open-Domain Question Answering: A Comprehensive Survey

Aug 05, 2022
Xiaoyu Shen, Svitlana Vakulenko, Marco del Tredici, Gianni Barlacchi, Bill Byrne, Adrià de Gispert

Figure 1 for Low-Resource Dense Retrieval for Open-Domain Question Answering: A Comprehensive Survey
Figure 2 for Low-Resource Dense Retrieval for Open-Domain Question Answering: A Comprehensive Survey
Figure 3 for Low-Resource Dense Retrieval for Open-Domain Question Answering: A Comprehensive Survey
Figure 4 for Low-Resource Dense Retrieval for Open-Domain Question Answering: A Comprehensive Survey

Dense retrieval (DR) approaches based on powerful pre-trained language models (PLMs) achieved significant advances and have become a key component for modern open-domain question-answering systems. However, they require large amounts of manual annotations to perform competitively, which is infeasible to scale. To address this, a growing body of research works have recently focused on improving DR performance under low-resource scenarios. These works differ in what resources they require for training and employ a diverse set of techniques. Understanding such differences is crucial for choosing the right technique under a specific low-resource scenario. To facilitate this understanding, we provide a thorough structured overview of mainstream techniques for low-resource DR. Based on their required resources, we divide the techniques into three main categories: (1) only documents are needed; (2) documents and questions are needed; and (3) documents and question-answer pairs are needed. For every technique, we introduce its general-form algorithm, highlight the open issues and pros and cons. Promising directions are outlined for future research.

Viaarxiv icon

Meta Self-Refinement for Robust Learning with Weak Supervision

May 15, 2022
Dawei Zhu, Xiaoyu Shen, Michael A. Hedderich, Dietrich Klakow

Figure 1 for Meta Self-Refinement for Robust Learning with Weak Supervision
Figure 2 for Meta Self-Refinement for Robust Learning with Weak Supervision
Figure 3 for Meta Self-Refinement for Robust Learning with Weak Supervision
Figure 4 for Meta Self-Refinement for Robust Learning with Weak Supervision

Training deep neural networks (DNNs) with weak supervision has been a hot topic as it can significantly reduce the annotation cost. However, labels from weak supervision can be rather noisy and the high capacity of DNNs makes them easy to overfit the noisy labels. Recent methods leverage self-training techniques to train noise-robust models, where a teacher trained on noisy labels is used to teach a student. However, the teacher from such models might fit a substantial amount of noise and produce wrong pseudo-labels with high confidence, leading to error propagation. In this work, we propose Meta Self-Refinement (MSR), a noise-resistant learning framework, to effectively combat noisy labels from weak supervision sources. Instead of purely relying on a fixed teacher trained on noisy labels, we keep updating the teacher to refine its pseudo-labels. At each training step, it performs a meta gradient descent on the current mini-batch to maximize the student performance on a clean validation set. Extensive experimentation on eight NLP benchmarks demonstrates that MSR is robust against noise in all settings and outperforms the state-of-the-art up to 11.4% in accuracy and 9.26% in F1 score.

Viaarxiv icon

A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

May 04, 2022
David Ifeoluwa Adelani, Jesujoba Oluwadara Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajuddeen Gwadabe, Freshia Sackey, Bonaventure F. P. Dossou, Chris Chinenye Emezue, Colin Leong, Michael Beukman, Shamsuddeen Hassan Muhammad, Guyo Dub Jarso, Oreen Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme, Eric Peter Wairagala, Muhammad Umair Nasir, Benjamin Ayoade Ajibade, Tunde Oluwaseyi Ajayi, Yvonne Wambui Gitau, Jade Abbott, Mohamed Ahmed, Millicent Ochieng, Anuoluwapo Aremu, Perez Ogayo, Jonathan Mukiibi, Fatoumata Ouoba Kabore, Godson Koffi Kalipe, Derguene Mbaye, Allahsera Auguste Tapo, Victoire Memdjokam Koagne, Edwin Munkoh-Buabeng, Valencia Wagner, Idris Abdulmumin, Ayodele Awokoya, Happy Buzaaba, Blessing Sibanda, Andiswa Bukula, Sam Manthalu

Figure 1 for A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation
Figure 2 for A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation
Figure 3 for A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation
Figure 4 for A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

Recent advances in the pre-training of language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages are not well represented on the web and therefore excluded from the large-scale crawls used to create datasets. Furthermore, downstream users of these models are restricted to the selection of languages originally chosen for pre-training. This work investigates how to optimally leverage existing pre-trained models to create low-resource translation systems for 16 African languages. We focus on two questions: 1) How can pre-trained models be used for languages not included in the initial pre-training? and 2) How can the resulting translation models effectively transfer to new domains? To answer these questions, we create a new African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset. We demonstrate that the most effective strategy for transferring both to additional languages and to additional domains is to fine-tune large pre-trained models on small quantities of high-quality translation data.

* Accepted to NAACL 2022 
Viaarxiv icon

A Survey on Legal Judgment Prediction: Datasets, Metrics, Models and Challenges

Apr 11, 2022
Junyun Cui, Xiaoyu Shen, Feiping Nie, Zheng Wang, Jinglong Wang, Yulong Chen

Figure 1 for A Survey on Legal Judgment Prediction: Datasets, Metrics, Models and Challenges
Figure 2 for A Survey on Legal Judgment Prediction: Datasets, Metrics, Models and Challenges
Figure 3 for A Survey on Legal Judgment Prediction: Datasets, Metrics, Models and Challenges
Figure 4 for A Survey on Legal Judgment Prediction: Datasets, Metrics, Models and Challenges

Legal judgment prediction (LJP) applies Natural Language Processing (NLP) techniques to predict judgment results based on fact descriptions automatically. Recently, large-scale public datasets and advances in NLP research have led to increasing interest in LJP. Despite a clear gap between machine and human performance, impressive results have been achieved in various benchmark datasets. In this paper, to address the current lack of comprehensive survey of existing LJP tasks, datasets, models and evaluations, (1) we analyze 31 LJP datasets in 6 languages, present their construction process and define a classification method of LJP with 3 different attributes; (2) we summarize 14 evaluation metrics under four categories for different outputs of LJP tasks; (3) we review 12 legal-domain pretrained models in 3 languages and highlight 3 major research directions for LJP; (4) we show the state-of-art results for 8 representative datasets from different court cases and discuss the open challenges. This paper can provide up-to-date and comprehensive reviews to help readers understand the status of LJP. We hope to facilitate both NLP researchers and legal professionals for further joint efforts in this problem.

* 25 pages, 6 figures and 12 tables 
Viaarxiv icon

From Rewriting to Remembering: Common Ground for Conversational QA Models

Apr 08, 2022
Marco Del Tredici, Xiaoyu Shen, Gianni Barlacchi, Bill Byrne, Adrià de Gispert

Figure 1 for From Rewriting to Remembering: Common Ground for Conversational QA Models
Figure 2 for From Rewriting to Remembering: Common Ground for Conversational QA Models
Figure 3 for From Rewriting to Remembering: Common Ground for Conversational QA Models
Figure 4 for From Rewriting to Remembering: Common Ground for Conversational QA Models

In conversational QA, models have to leverage information in previous turns to answer upcoming questions. Current approaches, such as Question Rewriting, struggle to extract relevant information as the conversation unwinds. We introduce the Common Ground (CG), an approach to accumulate conversational information as it emerges and select the relevant information at every turn. We show that CG offers a more efficient and human-like way to exploit conversational information compared to existing approaches, leading to improvements on Open Domain Conversational QA.

* Accepted at NLP for ConvAI 
Viaarxiv icon