Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tiezheng Yu

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Feb 28, 2023

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung(+3 more)

Figure 1 for A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Figure 2 for A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Figure 3 for A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Figure 4 for A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Abstract:This paper proposes a framework for quantitatively evaluating interactive LLMs such as ChatGPT using publicly available data sets. We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. We find that ChatGPT outperforms LLMs with zero-shot learning on most tasks and even outperforms fine-tuned models on some tasks. We find that it is better at understanding non-Latin script languages than generating them. It is able to generate multimodal content from textual prompts, via an intermediate code generation step. Moreover, we find that ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning, hence making it an unreliable reasoner. It is, for example, better at deductive than inductive reasoning. ChatGPT suffers from hallucination problems like other LLMs and it generates more extrinsic hallucinations from its parametric memory as it does not have access to an external knowledge base. Finally, the interactive feature of ChatGPT enables human collaboration with the underlying LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++ on machine translation, in a multi-turn "prompt engineering" fashion. We also release codebase for evaluation set extraction.

* 52 pages

Via

Access Paper or Ask Questions

NusaCrowd: Open Source Initiative for Indonesian NLP Resources

Dec 20, 2022

Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Indra Winata, Bryan Wilie, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Fajri Koto(+37 more)

Abstract:We present NusaCrowd, a collaborative initiative to collect and unite existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have has brought together 137 datasets and 117 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their effectiveness has been demonstrated in multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and its local languages. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and its local languages. Our work is intended to help advance natural language processing research in under-represented languages.

Via

Access Paper or Ask Questions

RHO ($ρ$): Reducing Hallucination in Open-domain Dialogues with Knowledge Grounding

Dec 03, 2022

Ziwei Ji, Zihan Liu, Nayeon Lee, Tiezheng Yu, Bryan Wilie, Min Zeng, Pascale Fung

Figure 1 for RHO ($ρ$): Reducing Hallucination in Open-domain Dialogues with Knowledge Grounding

Figure 2 for RHO ($ρ$): Reducing Hallucination in Open-domain Dialogues with Knowledge Grounding

Figure 3 for RHO ($ρ$): Reducing Hallucination in Open-domain Dialogues with Knowledge Grounding

Figure 4 for RHO ($ρ$): Reducing Hallucination in Open-domain Dialogues with Knowledge Grounding

Abstract:Dialogue systems can leverage large pre-trained language models and knowledge to generate fluent and informative responses. However, these models are still prone to produce hallucinated responses not supported by the input source, which greatly hinders their application. The heterogeneity between external knowledge and dialogue context challenges representation learning and source integration, and further contributes to unfaithfulness. To handle this challenge and generate more faithful responses, this paper presents RHO ($\rho$) utilizing the representations of linked entities and relation predicates from a knowledge graph (KG). We propose (1) local knowledge grounding to combine textual embeddings with the corresponding KG embeddings; and (2) global knowledge grounding to equip RHO with multi-hop reasoning abilities via the attention mechanism. In addition, we devise a response re-ranking technique based on walks over KG sub-graphs for better conversational reasoning. Experimental results on OpenDialKG show that our approach significantly outperforms state-of-the-art methods on both automatic and human evaluation by a large margin, especially in hallucination reduction (17.54% in FeQA).

Via

Access Paper or Ask Questions

Casual Conversations v2: Designing a large consent-driven dataset to measure algorithmic bias and robustness

Nov 10, 2022

Caner Hazirbas, Yejin Bang, Tiezheng Yu, Parisa Assar, Bilal Porgali, Vítor Albiero, Stefan Hermanek, Jacqueline Pan, Emily McReynolds, Miranda Bogen(+2 more)

Figure 1 for Casual Conversations v2: Designing a large consent-driven dataset to measure algorithmic bias and robustness

Figure 2 for Casual Conversations v2: Designing a large consent-driven dataset to measure algorithmic bias and robustness

Figure 3 for Casual Conversations v2: Designing a large consent-driven dataset to measure algorithmic bias and robustness

Figure 4 for Casual Conversations v2: Designing a large consent-driven dataset to measure algorithmic bias and robustness

Abstract:Developing robust and fair AI systems require datasets with comprehensive set of labels that can help ensure the validity and legitimacy of relevant measurements. Recent efforts, therefore, focus on collecting person-related datasets that have carefully selected labels, including sensitive characteristics, and consent forms in place to use those attributes for model testing and development. Responsible data collection involves several stages, including but not limited to determining use-case scenarios, selecting categories (annotations) such that the data are fit for the purpose of measuring algorithmic bias for subgroups and most importantly ensure that the selected categories/subcategories are robust to regional diversities and inclusive of as many subgroups as possible. Meta, in a continuation of our efforts to measure AI algorithmic bias and robustness (https://ai.facebook.com/blog/shedding-light-on-fairness-in-ai-with-a-new-data-set), is working on collecting a large consent-driven dataset with a comprehensive list of categories. This paper describes our proposed design of such categories and subcategories for Casual Conversations v2.

Via

Access Paper or Ask Questions

Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values

Oct 14, 2022

Yejin Bang, Tiezheng Yu, Andrea Madotto, Zhaojiang Lin, Mona Diab, Pascale Fung

Figure 1 for Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values

Figure 2 for Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values

Figure 3 for Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values

Figure 4 for Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values

Abstract:Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values. Yet, human values can vary under diverse cultural conditions. Therefore, we introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command. Along with the task, we propose a practical approach that distills value-aligned knowledge from large-scale language models (LLMs) to construct value-aligned classifiers in two steps. First, we generate value-aligned training data from LLMs by prompt-based few-shot learning. Next, we fine-tune smaller classification models with the generated data for the task. Empirical results show that our VA-Models surpass multiple baselines by at least 15.56% on the F1-score, including few-shot learning with OPT-175B and existing text augmentation methods. We suggest that using classifiers with explicit human value input improves both inclusivity & explainability in AI.

Via

Access Paper or Ask Questions

Kaggle Competition: Cantonese Audio-Visual Speech Recognition for In-car Commands

Jul 06, 2022

Wenliang Dai, Samuel Cahyawijaya, Tiezheng Yu, Elham J Barezi, Pascale Fung

Figure 1 for Kaggle Competition: Cantonese Audio-Visual Speech Recognition for In-car Commands

Figure 2 for Kaggle Competition: Cantonese Audio-Visual Speech Recognition for In-car Commands

Figure 3 for Kaggle Competition: Cantonese Audio-Visual Speech Recognition for In-car Commands

Figure 4 for Kaggle Competition: Cantonese Audio-Visual Speech Recognition for In-car Commands

Abstract:With the rise of deep learning and intelligent vehicles, the smart assistant has become an essential in-car component to facilitate driving and provide extra functionalities. In-car smart assistants should be able to process general as well as car-related commands and perform corresponding actions, which eases driving and improves safety. However, in this research field, most datasets are in major languages, such as English and Chinese. There is a huge data scarcity issue for low-resource languages, hindering the development of research and applications for broader communities. Therefore, it is crucial to have more benchmarks to raise awareness and motivate the research in low-resource languages. To mitigate this problem, we collect a new dataset, namely Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR), for in-car speech recognition in the Cantonese language with video and audio data. Together with it, we propose Cantonese Audio-Visual Speech Recognition for In-car Commands as a new challenge for the community to tackle low-resource speech recognition under in-car scenarios.

Via

Access Paper or Ask Questions

AiSocrates: Towards Answering Ethical Quandary Questions

May 24, 2022

Yejin Bang, Nayeon Lee, Tiezheng Yu, Leila Khalatbari, Yan Xu, Dan Su, Elham J. Barezi, Andrea Madotto, Hayden Kee, Pascale Fung

Figure 1 for AiSocrates: Towards Answering Ethical Quandary Questions

Figure 2 for AiSocrates: Towards Answering Ethical Quandary Questions

Figure 3 for AiSocrates: Towards Answering Ethical Quandary Questions

Figure 4 for AiSocrates: Towards Answering Ethical Quandary Questions

Abstract:Considerable advancements have been made in various NLP tasks based on the impressive power of large pre-trained language models (LLMs). These results have inspired efforts to understand the limits of LLMs so as to evaluate how far we are from achieving human level general natural language understanding. In this work, we challenge the capability of LLMs with the new task of Ethical Quandary Generative Question Answering. Ethical quandary questions are more challenging to address because multiple conflicting answers may exist to a single quandary. We propose a system, AiSocrates, that provides an answer with a deliberative exchange of different perspectives to an ethical quandary, in the approach of Socratic philosophy, instead of providing a closed answer like an oracle. AiSocrates searches for different ethical principles applicable to the ethical quandary and generates an answer conditioned on the chosen principles through prompt-based few-shot learning. We also address safety concerns by providing a human controllability option in choosing ethical principles. We show that AiSocrates generates promising answers to ethical quandary questions with multiple perspectives, 6.92% more often than answers written by human philosophers by one measure, but the system still needs improvement to match the coherence of human philosophers fully. We argue that AiSocrates is a promising step toward developing an NLP system that incorporates human values explicitly by prompt instructions. We are releasing the code for research purposes.

* 11 pages

Via

Access Paper or Ask Questions

NeuS: Neutral Multi-News Summarization for Mitigating Framing Bias

Apr 17, 2022

Nayeon Lee, Yejin Bang, Tiezheng Yu, Andrea Madotto, Pascale Fung

Figure 1 for NeuS: Neutral Multi-News Summarization for Mitigating Framing Bias

Figure 2 for NeuS: Neutral Multi-News Summarization for Mitigating Framing Bias

Figure 3 for NeuS: Neutral Multi-News Summarization for Mitigating Framing Bias

Figure 4 for NeuS: Neutral Multi-News Summarization for Mitigating Framing Bias

Abstract:Media framing bias can lead to increased political polarization, and thus, the need for automatic mitigation methods is growing. We propose a new task, a neutral summary generation from multiple news headlines of the varying political leanings to facilitate balanced and unbiased news reading. In this paper, we first collect a new dataset, obtain insights about framing bias through a case study, and propose a new effective metric and models for the task. Lastly, we conduct experimental analyses to provide insights about remaining challenges and future directions. One of the most interesting observations is that generation models can hallucinate not only factually inaccurate or unverifiable content, but also politically biased content.

* NAACL2022 Long Paper

Via

Access Paper or Ask Questions

SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study

Apr 14, 2022

Samuel Cahyawijaya, Tiezheng Yu, Zihan Liu, Tiffany T. W. Mak, Xiaopu Zhou, Nancy Y. Ip, Pascale Fung

Figure 1 for SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study

Figure 2 for SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study

Figure 3 for SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study

Figure 4 for SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study

Abstract:Self-supervised pre-training methods have brought remarkable breakthroughs in the understanding of text, image, and speech. Recent developments in genomics has also adopted these pre-training methods for genome understanding. However, they focus only on understanding haploid sequences, which hinders their applicability towards understanding genetic variations, also known as single nucleotide polymorphisms (SNPs), which is crucial for genome-wide association study. In this paper, we introduce SNP2Vec, a scalable self-supervised pre-training approach for understanding SNP. We apply SNP2Vec to perform long-sequence genomics modeling, and we evaluate the effectiveness of our approach on predicting Alzheimer's disease risk in a Chinese cohort. Our approach significantly outperforms existing polygenic risk score methods and all other baselines, including the model that is trained entirely with haploid sequences. We release our code and dataset on https://github.com/HLTCHKUST/snp2vec.

Via

Access Paper or Ask Questions

Survey of Hallucination in Natural Language Generation

Feb 08, 2022

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, Pascale Fung

Figure 1 for Survey of Hallucination in Natural Language Generation

Figure 2 for Survey of Hallucination in Natural Language Generation

Figure 3 for Survey of Hallucination in Natural Language Generation

Figure 4 for Survey of Hallucination in Natural Language Generation

Abstract:Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent natural language generation, naturally leading to development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also investigated that such generation includes hallucinated texts, which makes the performances of text generation fail to meet users' expectations in many real-world scenarios. In order to address this issue, studies in evaluation and mitigation methods of hallucinations have been presented in various tasks, but have not been reviewed in a combined manner. In this survey, we provide a broad overview of the research progress and challenges in the hallucination problem of NLG. The survey is organized into two big divisions: (i) a general overview of metrics, mitigation methods, and future directions; (ii) task-specific research progress for hallucinations in a large set of downstream tasks: abstractive summarization, dialogue generation, generative question answering, data-to-text generation, and machine translation. This survey could facilitate collaborative efforts among researchers in these tasks.

Via

Access Paper or Ask Questions