Alert button
Picture for Leyang Cui

Leyang Cui

Alert button

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Sep 03, 2023
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, Shuming Shi

Figure 1 for Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Figure 2 for Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Figure 3 for Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Figure 4 for Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

While large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks, a significant concern revolves around their propensity to exhibit hallucinations: LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge. This phenomenon poses a substantial challenge to the reliability of LLMs in real-world scenarios. In this paper, we survey recent efforts on the detection, explanation, and mitigation of hallucination, with an emphasis on the unique challenges posed by LLMs. We present taxonomies of the LLM hallucination phenomena and evaluation benchmarks, analyze existing approaches aiming at mitigating LLM hallucination, and discuss potential directions for future research.

* work in progress; 32 pages 
Viaarxiv icon

Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling

Jul 22, 2023
Longyue Wang, Zefeng Du, Donghuai Liu, Deng Cai, Dian Yu, Haiyun Jiang, Yan Wang, Leyang Cui, Shuming Shi, Zhaopeng Tu

Figure 1 for Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling
Figure 2 for Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling
Figure 3 for Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling
Figure 4 for Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling

Modeling discourse -- the linguistic phenomena that go beyond individual sentences, is a fundamental yet challenging aspect of natural language processing (NLP). However, existing evaluation benchmarks primarily focus on the evaluation of inter-sentence properties and overlook critical discourse phenomena that cross sentences. To bridge the gap, we propose Disco-Bench, a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks, covering understanding, translation, and generation. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena (e.g. cohesion and coherence) in Chinese and/or English. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge. We totally evaluate 20 general-, in-domain and commercial models based on Transformer, advanced pretraining architectures and large language models (LLMs). Our results show (1) the challenge and necessity of our evaluation benchmark; (2) fine-grained pretraining based on literary document-level training data consistently improves the modeling of discourse information. We will release the datasets, pretrained models, and leaderboard, which we hope can significantly facilitate research in this field: https://github.com/longyuewangdcu/Disco-Bench.

* Zhaopeng Tu is the corresponding author 
Viaarxiv icon

Automated Action Model Acquisition from Narrative Texts

Jul 17, 2023
Ruiqi Li, Leyang Cui, Songtuan Lin, Patrik Haslum

Figure 1 for Automated Action Model Acquisition from Narrative Texts
Figure 2 for Automated Action Model Acquisition from Narrative Texts
Figure 3 for Automated Action Model Acquisition from Narrative Texts
Figure 4 for Automated Action Model Acquisition from Narrative Texts

Action models, which take the form of precondition/effect axioms, facilitate causal and motivational connections between actions for AI agents. Action model acquisition has been identified as a bottleneck in the application of planning technology, especially within narrative planning. Acquiring action models from narrative texts in an automated way is essential, but challenging because of the inherent complexities of such texts. We present NaRuto, a system that extracts structured events from narrative text and subsequently generates planning-language-style action models based on predictions of commonsense event relations, as well as textual contradictions and similarities, in an unsupervised manner. Experimental results in classical narrative planning domains show that NaRuto can generate action models of significantly better quality than existing fully automated methods, and even on par with those of semi-automated methods.

* 10 pages, 3 figures 
Viaarxiv icon

Explicit Syntactic Guidance for Neural Text Generation

Jun 25, 2023
Yafu Li, Leyang Cui, Jianhao Yan, Yongjing Yin, Wei Bi, Shuming Shi, Yue Zhang

Figure 1 for Explicit Syntactic Guidance for Neural Text Generation
Figure 2 for Explicit Syntactic Guidance for Neural Text Generation
Figure 3 for Explicit Syntactic Guidance for Neural Text Generation
Figure 4 for Explicit Syntactic Guidance for Neural Text Generation

Most existing text generation models follow the sequence-to-sequence paradigm. Generative Grammar suggests that humans generate natural language texts by learning language grammar. We propose a syntax-guided generation schema, which generates the sequence guided by a constituency parse tree in a top-down direction. The decoding process can be decomposed into two parts: (1) predicting the infilling texts for each constituent in the lexicalized syntax context given the source sentence; (2) mapping and expanding each constituent to construct the next-level syntax context. Accordingly, we propose a structural beam search method to find possible syntax structures hierarchically. Experiments on paraphrase generation and machine translation show that the proposed method outperforms autoregressive baselines, while also demonstrating effectiveness in terms of interpretability, controllability, and diversity.

* ACL 2023 
Viaarxiv icon

Enhancing Grammatical Error Correction Systems with Explanations

May 25, 2023
Yuejiao Fei, Leyang Cui, Sen Yang, Wai Lam, Zhenzhong Lan, Shuming Shi

Figure 1 for Enhancing Grammatical Error Correction Systems with Explanations
Figure 2 for Enhancing Grammatical Error Correction Systems with Explanations
Figure 3 for Enhancing Grammatical Error Correction Systems with Explanations
Figure 4 for Enhancing Grammatical Error Correction Systems with Explanations

Grammatical error correction systems improve written communication by detecting and correcting language mistakes. To help language learners better understand why the GEC system makes a certain correction, the causes of errors (evidence words) and the corresponding error types are two key factors. To enhance GEC systems with explanations, we introduce EXPECT, a large dataset annotated with evidence words and grammatical error types. We propose several baselines and anlysis to understand this task. Furthermore, human evaluation verifies our explainable GEC system's explanations can assist second-language learners in determining whether to accept a correction suggestion and in understanding the associated grammar rule.

* 9 pages, 7 figures, accepted to the main conference of ACL 2023 
Viaarxiv icon

Deepfake Text Detection in the Wild

May 22, 2023
Yafu Li, Qintong Li, Leyang Cui, Wei Bi, Longyue Wang, Linyi Yang, Shuming Shi, Yue Zhang

Figure 1 for Deepfake Text Detection in the Wild
Figure 2 for Deepfake Text Detection in the Wild
Figure 3 for Deepfake Text Detection in the Wild
Figure 4 for Deepfake Text Detection in the Wild

Recent advances in large language models have enabled them to reach a level of text generation comparable to that of humans. These models show powerful capabilities across a wide range of content, including news article writing, story generation, and scientific writing. Such capability further narrows the gap between human-authored and machine-generated texts, highlighting the importance of deepfake text detection to avoid potential risks such as fake news propagation and plagiarism. However, previous work has been limited in that they testify methods on testbed of specific domains or certain language models. In practical scenarios, the detector faces texts from various domains or LLMs without knowing their sources. To this end, we build a wild testbed by gathering texts from various human writings and deepfake texts generated by different LLMs. Human annotators are only slightly better than random guessing at identifying machine-generated texts. Empirical results on automatic detection methods further showcase the challenges of deepfake text detection in a wild testbed. In addition, out-of-distribution poses a greater challenge for a detector to be employed in realistic application scenarios. We release our resources at https://github.com/yafuly/DeepfakeTextDetect.

* Working in progress 
Viaarxiv icon

Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A Preliminary Study on Writing Assistance

May 22, 2023
Yue Zhang, Leyang Cui, Deng Cai, Xinting Huang, Tao Fang, Wei Bi

Figure 1 for Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A Preliminary Study on Writing Assistance
Figure 2 for Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A Preliminary Study on Writing Assistance
Figure 3 for Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A Preliminary Study on Writing Assistance
Figure 4 for Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A Preliminary Study on Writing Assistance

ChatGPT and GPT-4 have attracted substantial interest from both academic and industrial circles, owing to their remarkable few-shot (or even zero-shot) ability to handle various tasks. Recent work shows that, after being fine-tuned with a few sets of instruction-driven data, the recently proposed LLM, LLaMa, exhibits an impressive capability to address a broad range of tasks. However, the zero-shot performance of LLMs does not consistently outperform that of models fined-tuned for specific scenarios. To explore whether the capabilities of LLMs can be further enhanced for specific scenarios, we choose the writing-assistance scenario as the testbed, including seven writing tasks. We collect training data for these tasks, reframe them in an instruction-following format, and subsequently refine LLaMa via instruction tuning. Experimental results show that continually fine-tuning LLaMa on writing instruction data significantly improves its ability on writing tasks. We also conduct more experiments and analyses to offer insights for future work on effectively fine-tuning LLaMa for specific scenarios.

* Work in progress 
Viaarxiv icon

LogiCoT: Logical Chain-of-Thought Instruction-Tuning Data Collection with GPT-4

May 20, 2023
Hanmeng Liu, Zhiyang Teng, Leyang Cui, Chaoli Zhang, Qiji Zhou, Yue Zhang

Figure 1 for LogiCoT: Logical Chain-of-Thought Instruction-Tuning Data Collection with GPT-4
Figure 2 for LogiCoT: Logical Chain-of-Thought Instruction-Tuning Data Collection with GPT-4
Figure 3 for LogiCoT: Logical Chain-of-Thought Instruction-Tuning Data Collection with GPT-4

Generative Pre-trained Transformer 4 (GPT-4) demonstrates impressive chain-of-thought reasoning ability. Recent work on self-instruction tuning, such as Alpaca, has focused on enhancing the general proficiency of models. These instructions enable the model to achieve performance comparable to GPT-3.5 on general tasks like open-domain text generation and paraphrasing. However, they fall short of helping the model handle complex reasoning tasks. To bridge the gap, this paper presents LogiCoT, a new instruction-tuning dataset for Logical Chain-of-Thought reasoning with GPT-4. We elaborate on the process of harvesting instructions for prompting GPT-4 to generate chain-of-thought rationales. LogiCoT serves as an instruction set for teaching models of logical reasoning and elicits general reasoning skills.

Viaarxiv icon