Alert button
Picture for Genta Indra Winata

Genta Indra Winata

Alert button

IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages

Nov 21, 2023
Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Genta Indra Winata, Pascale Fung, Ayu Purwarianti

Significant progress has been made on Indonesian NLP. Nevertheless, exploration of the code-mixing phenomenon in Indonesian is limited, despite many languages being frequently mixed with Indonesian in daily conversation. In this work, we explore code-mixing in Indonesian with four embedded languages, i.e., English, Sundanese, Javanese, and Malay; and introduce IndoRobusta, a framework to evaluate and improve the code-mixing robustness. Our analysis shows that the pre-training corpus bias affects the model's ability to better handle Indonesian-English code-mixing when compared to other local languages, despite having higher language diversity.

Viaarxiv icon

IndoToD: A Multi-Domain Indonesian Benchmark For End-to-End Task-Oriented Dialogue Systems

Nov 02, 2023
Muhammad Dehan Al Kautsar, Rahmah Khoirussyifa' Nurdini, Samuel Cahyawijaya, Genta Indra Winata, Ayu Purwarianti

Task-oriented dialogue (ToD) systems have been mostly created for high-resource languages, such as English and Chinese. However, there is a need to develop ToD systems for other regional or local languages to broaden their ability to comprehend the dialogue contexts in various languages. This paper introduces IndoToD, an end-to-end multi domain ToD benchmark in Indonesian. We extend two English ToD datasets to Indonesian, comprising four different domains by delexicalization to efficiently reduce the size of annotations. To ensure a high-quality data collection, we hire native speakers to manually translate the dialogues. Along with the original English datasets, these new Indonesian datasets serve as an effective benchmark for evaluating Indonesian and English ToD systems as well as exploring the potential benefits of cross-lingual and bilingual transfer learning approaches.

* 2023 1st Workshop in South East Asian Language Processing (SEALP), Co-located with AACL 2023 
Viaarxiv icon

NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages

Sep 20, 2023
Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Dea Adhista, Emmanuel Dave, Sarah Oktavianti, Salsabil Maulana Akbar, Jhonson Lee, Nuur Shadieq, Tjeng Wawan Cenggoro, Hanung Wahyuning Linuwih, Bryan Wilie, Galih Pradipta Muridan, Genta Indra Winata, David Moeljadi, Alham Fikri Aji, Ayu Purwarianti, Pascale Fung

Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resulting corpora, including a lack of lexical diversity and cultural relevance to local communities. To address this gap, we conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content. In addition, we present the \datasetname{} benchmark, encompassing 12 underrepresented and extremely low-resource languages spoken by millions of individuals in Indonesia. Our empirical experiment results using existing multilingual large language models conclude the need to extend these models to more underrepresented languages. We release the NusaWrites dataset at https://github.com/IndoNLP/nusa-writes.

Viaarxiv icon

Multilingual Few-Shot Learning via Language Model Retrieval

Jun 19, 2023
Genta Indra Winata, Liang-Kang Huang, Soumya Vadlamannati, Yash Chandarana

Figure 1 for Multilingual Few-Shot Learning via Language Model Retrieval
Figure 2 for Multilingual Few-Shot Learning via Language Model Retrieval
Figure 3 for Multilingual Few-Shot Learning via Language Model Retrieval
Figure 4 for Multilingual Few-Shot Learning via Language Model Retrieval

Transformer-based language models have achieved remarkable success in few-shot in-context learning and drawn a lot of research interest. However, these models' performance greatly depends on the choice of the example prompts and also has high variability depending on how samples are chosen. In this paper, we conduct a comprehensive study of retrieving semantically similar few-shot samples and using them as the context, as it helps the model decide the correct label without any gradient update in the multilingual and cross-lingual settings. We evaluate the proposed method on five natural language understanding datasets related to intent detection, question classification, sentiment analysis, and topic classification. The proposed method consistently outperforms random sampling in monolingual and cross-lingual tasks in non-English languages.

* 9 pages 
Viaarxiv icon

On "Scientific Debt" in NLP: A Case for More Rigour in Language Model Pre-Training Research

Jun 05, 2023
Made Nindyatama Nityasya, Haryo Akbarianto Wibowo, Alham Fikri Aji, Genta Indra Winata, Radityo Eko Prasojo, Phil Blunsom, Adhiguna Kuncoro

Figure 1 for On "Scientific Debt" in NLP: A Case for More Rigour in Language Model Pre-Training Research
Figure 2 for On "Scientific Debt" in NLP: A Case for More Rigour in Language Model Pre-Training Research
Figure 3 for On "Scientific Debt" in NLP: A Case for More Rigour in Language Model Pre-Training Research
Figure 4 for On "Scientific Debt" in NLP: A Case for More Rigour in Language Model Pre-Training Research

This evidence-based position paper critiques current research practices within the language model pre-training literature. Despite rapid recent progress afforded by increasingly better pre-trained language models (PLMs), current PLM research practices often conflate different possible sources of model improvement, without conducting proper ablation studies and principled comparisons between different models under comparable conditions. These practices (i) leave us ill-equipped to understand which pre-training approaches should be used under what circumstances; (ii) impede reproducibility and credit assignment; and (iii) render it difficult to understand: "How exactly does each factor contribute to the progress that we have today?" We provide a case in point by revisiting the success of BERT over its baselines, ELMo and GPT-1, and demonstrate how -- under comparable conditions where the baselines are tuned to a similar extent -- these baselines (and even-simpler variants thereof) can, in fact, achieve competitive or better performance than BERT. These findings demonstrate how disentangling different factors of model improvements can lead to valuable new insights. We conclude with recommendations for how to encourage and incentivize this line of work, and accelerate progress towards a better and more systematic understanding of what factors drive the progress of our foundation models today.

* Accepted at ACL 2023 
Viaarxiv icon

Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning

May 25, 2023
Genta Indra Winata, Lingjue Xie, Karthik Radhakrishnan, Shijie Wu, Xisen Jin, Pengxiang Cheng, Mayank Kulkarni, Daniel Preotiuc-Pietro

Figure 1 for Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning
Figure 2 for Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning
Figure 3 for Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning
Figure 4 for Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning

Real-life multilingual systems should be able to efficiently incorporate new languages as data distributions fed to the system evolve and shift over time. To do this, systems need to handle the issue of catastrophic forgetting, where the model performance drops for languages or tasks seen further in its past. In this paper, we study catastrophic forgetting, as well as methods to minimize this, in a massively multilingual continual learning framework involving up to 51 languages and covering both classification and sequence labeling tasks. We present LR ADJUST, a learning rate scheduling method that is simple, yet effective in preserving new information without strongly overwriting past knowledge. Furthermore, we show that this method is effective across multiple continual learning approaches. Finally, we provide further insights into the dynamics of catastrophic forgetting in this massively multilingual setup.

* ACL 2023 Findings 
Viaarxiv icon

Multi-lingual and Multi-cultural Figurative Language Understanding

May 25, 2023
Anubha Kabra, Emmy Liu, Simran Khanuja, Alham Fikri Aji, Genta Indra Winata, Samuel Cahyawijaya, Anuoluwapo Aremu, Perez Ogayo, Graham Neubig

Figure 1 for Multi-lingual and Multi-cultural Figurative Language Understanding
Figure 2 for Multi-lingual and Multi-cultural Figurative Language Understanding
Figure 3 for Multi-lingual and Multi-cultural Figurative Language Understanding
Figure 4 for Multi-lingual and Multi-cultural Figurative Language Understanding

Figurative language permeates human communication, but at the same time is relatively understudied in NLP. Datasets have been created in English to accelerate progress towards measuring and improving figurative language processing in language models (LMs). However, the use of figurative language is an expression of our cultural and societal experiences, making it difficult for these phrases to be universally applicable. In this work, we create a figurative language inference dataset, \datasetname, for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba. Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region. We assess multilingual LMs' abilities to interpret figurative language in zero-shot and few-shot settings. All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data, emphasizing the need for LMs to be exposed to a broader range of linguistic and cultural variation during training.

* ACL 2023 Findings 
Viaarxiv icon

GlobalBench: A Benchmark for Global Progress in Natural Language Processing

May 24, 2023
Yueqi Song, Catherine Cui, Simran Khanuja, Pengfei Liu, Fahim Faisal, Alissa Ostapenko, Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Yulia Tsvetkov, Antonios Anastasopoulos, Graham Neubig

Figure 1 for GlobalBench: A Benchmark for Global Progress in Natural Language Processing
Figure 2 for GlobalBench: A Benchmark for Global Progress in Natural Language Processing
Figure 3 for GlobalBench: A Benchmark for Global Progress in Natural Language Processing
Figure 4 for GlobalBench: A Benchmark for Global Progress in Natural Language Processing

Despite the major advances in NLP, significant disparities in NLP system performance across languages still exist. Arguably, these are due to uneven resource allocation and sub-optimal incentives to work on less resourced languages. To track and further incentivize the global development of equitable language technology, we introduce GlobalBench. Prior multilingual benchmarks are static and have focused on a limited number of tasks and languages. In contrast, GlobalBench is an ever-expanding collection that aims to dynamically track progress on all NLP datasets in all languages. Rather than solely measuring accuracy, GlobalBench also tracks the estimated per-speaker utility and equity of technology across all languages, providing a multi-faceted view of how language technology is serving people of the world. Furthermore, GlobalBench is designed to identify the most under-served languages, and rewards research efforts directed towards those languages. At present, the most under-served languages are the ones with a relatively high population, but nonetheless overlooked by composite multilingual benchmarks (like Punjabi, Portuguese, and Wu Chinese). Currently, GlobalBench covers 966 datasets in 190 languages, and has 1,128 system submissions spanning 62 languages.

* Preprint, 9 pages 
Viaarxiv icon

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

Mar 30, 2023
Zheng-Xin Yong, Ruochen Zhang, Jessica Zosa Forde, Skyler Wang, Samuel Cahyawijaya, Holy Lovenia, Genta Indra Winata, Lintang Sutawika, Jan Christian Blaise Cruz, Long Phan, Yin Lin Tan, Alham Fikri Aji

Figure 1 for Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Figure 2 for Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Figure 3 for Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Figure 4 for Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The proliferation of Large Language Models (LLMs) in recent times compels one to ask: can these systems be used for data generation? In this article, we explore prompting multilingual LLMs in a zero-shot manner to create code-mixed data for five languages in South East Asia (SEA) -- Indonesian, Malay, Chinese, Tagalog, Vietnamese, as well as the creole language Singlish. We find that ChatGPT shows the most potential, capable of producing code-mixed text 68% of the time when the term "code-mixing" is explicitly defined. Moreover, both ChatGPT's and InstructGPT's (davinci-003) performances in generating Singlish texts are noteworthy, averaging a 96% success rate across a variety of prompts. Their code-mixing proficiency, however, is dampened by word choice errors that lead to semantic inaccuracies. Other multilingual models such as BLOOMZ and Flan-T5-XXL are unable to produce code-mixed texts altogether. By highlighting the limited promises of LLMs in a specific form of low-resource data generation, we call for a measured approach when applying similar techniques to other data-scarce NLP contexts.

Viaarxiv icon

NusaCrowd: Open Source Initiative for Indonesian NLP Resources

Dec 20, 2022
Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Indra Winata, Bryan Wilie, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Fajri Koto, Jennifer Santoso, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Ivan Halim Parmonangan, Ika Alfina, Muhammad Satrio Wicaksono, Ilham Firdausi Putra, Samsul Rahmadani, Yulianti Oenang, Ali Akbar Septiandri, James Jaya, Kaustubh D. Dhole, Arie Ardiyanti Suryani, Rifki Afina Putri, Dan Su, Keith Stevens, Made Nindyatama Nityasya, Muhammad Farid Adilazuarda, Ryan Ignatius, Ryandito Diandaru, Tiezheng Yu, Vito Ghifari, Wenliang Dai, Yan Xu, Dyah Damapuspita, Cuk Tho, Ichwanul Muslim Karo Karo, Tirana Noor Fatyanosa, Ziwei Ji, Pascale Fung, Graham Neubig, Timothy Baldwin, Sebastian Ruder, Herry Sujaini, Sakriani Sakti, Ayu Purwarianti

Figure 1 for NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Figure 2 for NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Figure 3 for NusaCrowd: Open Source Initiative for Indonesian NLP Resources
Figure 4 for NusaCrowd: Open Source Initiative for Indonesian NLP Resources

We present NusaCrowd, a collaborative initiative to collect and unite existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have has brought together 137 datasets and 117 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their effectiveness has been demonstrated in multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and its local languages. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and its local languages. Our work is intended to help advance natural language processing research in under-represented languages.

Viaarxiv icon