Alert button
Picture for Xianpei Han

Xianpei Han

Alert button

Mitigating Large Language Model Hallucinations via Autonomous Knowledge Graph-based Retrofitting

Nov 22, 2023
Xinyan Guan, Yanjiang Liu, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, Le Sun

Incorporating factual knowledge in knowledge graph is regarded as a promising approach for mitigating the hallucination of large language models (LLMs). Existing methods usually only use the user's input to query the knowledge graph, thus failing to address the factual hallucination generated by LLMs during its reasoning process. To address this problem, this paper proposes Knowledge Graph-based Retrofitting (KGR), a new framework that incorporates LLMs with KGs to mitigate factual hallucination during the reasoning process by retrofitting the initial draft responses of LLMs based on the factual knowledge stored in KGs. Specifically, KGR leverages LLMs to extract, select, validate, and retrofit factual statements within the model-generated responses, which enables an autonomous knowledge verifying and refining procedure without any additional manual efforts. Experiments show that KGR can significantly improve the performance of LLMs on factual QA benchmarks especially when involving complex reasoning processes, which demonstrates the necessity and effectiveness of KGR in mitigating hallucination and enhancing the reliability of LLMs.

Viaarxiv icon

Toward Unified Controllable Text Generation via Regular Expression Instruction

Sep 20, 2023
Xin Zheng, Hongyu Lin, Xianpei Han, Le Sun

Figure 1 for Toward Unified Controllable Text Generation via Regular Expression Instruction
Figure 2 for Toward Unified Controllable Text Generation via Regular Expression Instruction
Figure 3 for Toward Unified Controllable Text Generation via Regular Expression Instruction
Figure 4 for Toward Unified Controllable Text Generation via Regular Expression Instruction

Controllable text generation is a fundamental aspect of natural language generation, with numerous methods proposed for different constraint types. However, these approaches often require significant architectural or decoding modifications, making them challenging to apply to additional constraints or resolve different constraint combinations. To address this, our paper introduces Regular Expression Instruction (REI), which utilizes an instruction-based mechanism to fully exploit regular expressions' advantages to uniformly model diverse constraints. Specifically, our REI supports all popular fine-grained controllable generation constraints, i.e., lexical, positional, and length, as well as their complex combinations, via regular expression-style instructions. Our method only requires fine-tuning on medium-scale language models or few-shot, in-context learning on large language models, and requires no further adjustment when applied to various constraint combinations. Experiments demonstrate that our straightforward approach yields high success rates and adaptability to various constraints while maintaining competitiveness in automatic metrics and outperforming most previous baselines.

* Accepted on IJCNLP-AACL 2023 
Viaarxiv icon

Benchmarking Large Language Models in Retrieval-Augmented Generation

Sep 04, 2023
Jiawei Chen, Hongyu Lin, Xianpei Han, Le Sun

Figure 1 for Benchmarking Large Language Models in Retrieval-Augmented Generation
Figure 2 for Benchmarking Large Language Models in Retrieval-Augmented Generation
Figure 3 for Benchmarking Large Language Models in Retrieval-Augmented Generation
Figure 4 for Benchmarking Large Language Models in Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.

Viaarxiv icon

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Jun 08, 2023
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Le Sun

Figure 1 for ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases
Figure 2 for ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases
Figure 3 for ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases
Figure 4 for ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Enabling large language models to effectively utilize real-world tools is crucial for achieving embodied intelligence. Existing approaches to tool learning have primarily relied on either extremely large language models, such as GPT-4, to attain generalized tool-use abilities in a zero-shot manner, or have utilized supervised learning to train limited types of tools on compact models. However, it remains uncertain whether smaller language models can achieve generalized tool-use abilities without specific tool-specific training. To address this question, this paper introduces ToolAlpaca, a novel framework designed to automatically generate a tool-use corpus and learn generalized tool-use abilities on compact language models with minimal human intervention. Specifically, ToolAlpaca first collects a comprehensive dataset by building a multi-agent simulation environment, which contains 3938 tool-use instances from more than 400 real-world tool APIs spanning 50 distinct categories. Subsequently, the constructed corpus is employed to fine-tune compact language models, resulting in two models, namely ToolAlpaca-7B and ToolAlpaca-13B, respectively. Finally, we evaluate the ability of these models to utilize previously unseen tools without specific training. Experimental results demonstrate that ToolAlpaca achieves effective generalized tool-use capabilities comparable to those of extremely large language models like GPT-3.5. This validation supports the notion that learning generalized tool-use abilities is feasible for compact language models.

Viaarxiv icon

Learning In-context Learning for Named Entity Recognition

May 26, 2023
Jiawei Chen, Yaojie Lu, Hongyu Lin, Jie Lou, Wei Jia, Dai Dai, Hua Wu, Boxi Cao, Xianpei Han, Le Sun

Figure 1 for Learning In-context Learning for Named Entity Recognition
Figure 2 for Learning In-context Learning for Named Entity Recognition
Figure 3 for Learning In-context Learning for Named Entity Recognition
Figure 4 for Learning In-context Learning for Named Entity Recognition

Named entity recognition in real-world applications suffers from the diversity of entity types, the emergence of new entity types, and the lack of high-quality annotations. To address the above problems, this paper proposes an in-context learning-based NER approach, which can effectively inject in-context NER ability into PLMs and recognize entities of novel types on-the-fly using only a few demonstrative instances. Specifically, we model PLMs as a meta-function $\mathcal{ \lambda_ {\text{instruction, demonstrations, text}}. M}$, and a new entity extractor can be implicitly constructed by applying new instruction and demonstrations to PLMs, i.e., $\mathcal{ (\lambda . M) }$(instruction, demonstrations) $\to$ $\mathcal{F}$ where $\mathcal{F}$ will be a new entity extractor, i.e., $\mathcal{F}$: text $\to$ entities. To inject the above in-context NER ability into PLMs, we propose a meta-function pre-training algorithm, which pre-trains PLMs by comparing the (instruction, demonstration)-initialized extractor with a surrogate golden extractor. Experimental results on 4 few-shot NER datasets show that our method can effectively inject in-context NER ability into PLMs and significantly outperforms the PLMs+fine-tuning counterparts.

* Accepted to ACL 2023 Main Conference 
Viaarxiv icon

DLUE: Benchmarking Document Language Understanding

May 16, 2023
Ruoxi Xu, Hongyu Lin, Xinyan Guan, Xianpei Han, Yingfei Sun, Le Sun

Figure 1 for DLUE: Benchmarking Document Language Understanding
Figure 2 for DLUE: Benchmarking Document Language Understanding
Figure 3 for DLUE: Benchmarking Document Language Understanding
Figure 4 for DLUE: Benchmarking Document Language Understanding

Understanding documents is central to many real-world tasks but remains a challenging topic. Unfortunately, there is no well-established consensus on how to comprehensively evaluate document understanding abilities, which significantly hinders the fair comparison and measuring the progress of the field. To benchmark document understanding researches, this paper summarizes four representative abilities, i.e., document classification, document structural analysis, document information extraction, and document transcription. Under the new evaluation framework, we propose \textbf{Document Language Understanding Evaluation} -- \textbf{DLUE}, a new task suite which covers a wide-range of tasks in various forms, domains and document genres. We also systematically evaluate six well-established transformer models on DLUE, and find that due to the lengthy content, complicated underlying structure and dispersed knowledge, document understanding is still far from being solved, and currently there is no neural architecture that dominates all tasks, raising requirements for a universal document understanding architecture.

Viaarxiv icon

Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism of Language Models

May 16, 2023
Boxi Cao, Qiaoyu Tang, Hongyu Lin, Xianpei Han, Jiawei Chen, Tianshu Wang, Le Sun

Figure 1 for Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism of Language Models
Figure 2 for Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism of Language Models
Figure 3 for Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism of Language Models
Figure 4 for Retentive or Forgetful? Diving into the Knowledge Memorizing Mechanism of Language Models

Memory is one of the most essential cognitive functions serving as a repository of world knowledge and episodes of activities. In recent years, large-scale pre-trained language models have shown remarkable memorizing ability. On the contrary, vanilla neural networks without pre-training have been long observed suffering from the catastrophic forgetting problem. To investigate such a retentive-forgetful contradiction and understand the memory mechanism of language models, we conduct thorough experiments by controlling the target knowledge types, the learning strategies and the learning schedules. We find that: 1) Vanilla language models are forgetful; 2) Pre-training leads to retentive language models; 3) Knowledge relevance and diversification significantly influence the memory formation. These conclusions are useful for understanding the abilities of pre-trained language models and shed light on designing and evaluating new learning and inference algorithms of language models.

Viaarxiv icon

Harvesting Event Schemas from Large Language Models

May 12, 2023
Jialong Tang, Hongyu Lin, Zhuoqun Li, Yaojie Lu, Xianpei Han, Le Sun

Figure 1 for Harvesting Event Schemas from Large Language Models
Figure 2 for Harvesting Event Schemas from Large Language Models
Figure 3 for Harvesting Event Schemas from Large Language Models
Figure 4 for Harvesting Event Schemas from Large Language Models

Event schema provides a conceptual, structural and formal language to represent events and model the world event knowledge. Unfortunately, it is challenging to automatically induce high-quality and high-coverage event schemas due to the open nature of real-world events, the diversity of event expressions, and the sparsity of event knowledge. In this paper, we propose a new paradigm for event schema induction -- knowledge harvesting from large-scale pre-trained language models, which can effectively resolve the above challenges by discovering, conceptualizing and structuralizing event schemas from PLMs. And an Event Schema Harvester (ESHer) is designed to automatically induce high-quality event schemas via in-context generation-based conceptualization, confidence-aware schema structuralization and graph-based schema aggregation. Empirical results show that ESHer can induce high-quality and high-coverage event schemas on varying domains.

* 14 pages 
Viaarxiv icon

A Drop of Ink may Make a Million Think: The Spread of False Information in Large Language Models

May 08, 2023
Ning Bian, Peilin Liu, Xianpei Han, Hongyu Lin, Yaojie Lu, Ben He, Le Sun

Figure 1 for A Drop of Ink may Make a Million Think: The Spread of False Information in Large Language Models
Figure 2 for A Drop of Ink may Make a Million Think: The Spread of False Information in Large Language Models
Figure 3 for A Drop of Ink may Make a Million Think: The Spread of False Information in Large Language Models
Figure 4 for A Drop of Ink may Make a Million Think: The Spread of False Information in Large Language Models

Large language models (LLMs) like ChatGPT have gained increasing prominence in artificial intelligence, making a profound impact on society and various industries like business and science. However, the presence of false information on the internet and in text corpus poses a significant risk to the reliability and safety of LLMs, underscoring the urgent need to understand the mechanisms of how false information impacts and spreads in LLMs. In this paper, we investigate how false information spreads in LLMs and affects related responses by conducting a series of experiments on the effects of source authority, injection paradigm, and information relevance. Specifically, we compare four authority levels of information sources (Twitter, web blogs, news reports, and research papers), two common knowledge injection paradigms (in-context injection and learning-based injection), and three degrees of information relevance (direct, indirect, and peripheral). The experimental results show that (1) False information will spread and contaminate related memories in LLMs via a semantic diffusion process, i.e., false information has global detrimental effects beyond its direct impact. (2) Current LLMs are susceptible to authority bias, i.e., LLMs are more likely to follow false information presented in a trustworthy style like news or research papers, which usually causes deeper and wider pollution of information. (3) Current LLMs are more sensitive to false information through in-context injection than through learning-based injection, which severely challenges the reliability and safety of LLMs even if all training data are trusty and correct. The above findings raise the need for new false information defense algorithms to address the global impact of false information, and new alignment algorithms to unbiasedly lead LLMs to follow internal human values rather than superficial patterns.

Viaarxiv icon