Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Houfeng Wang

Investigating the (De)Composition Capabilities of Large Language Models in Natural-to-Formal Language Conversion

Jan 24, 2025

Ziyao Xu, Houfeng Wang

Figure 1 for Investigating the (De)Composition Capabilities of Large Language Models in Natural-to-Formal Language Conversion

Figure 2 for Investigating the (De)Composition Capabilities of Large Language Models in Natural-to-Formal Language Conversion

Figure 3 for Investigating the (De)Composition Capabilities of Large Language Models in Natural-to-Formal Language Conversion

Figure 4 for Investigating the (De)Composition Capabilities of Large Language Models in Natural-to-Formal Language Conversion

Abstract:To achieve generalized and robust natural-to-formal language conversion (N2F), large language models (LLMs) need to have strong capabilities of decomposition and composition in N2F when faced with an unfamiliar formal language and be able to cope with compositional gaps and counter-intuitive symbolic names. To investigate whether LLMs have this set of basic capabilities in N2F, we propose the DEDC framework. This framework semi-automatically performs sample and task construction, allowing decoupled evaluation of the set of decomposition and composition capabilities of LLMs in N2F. Based on this framework, we evaluate and analyze the most advanced LLMs, and the main findings include that: (1) the LLMs are deficient in both decomposition and composition; (2) the LLMs show a wide coverage of error types that can be attributed to deficiencies in natural language understanding and the learning and use of symbolic systems; (3) compositional gaps and counter-intuitive symbolic names both affect the decomposition and composition of the LLMs. Our work provides a new perspective for investigating the basic capabilities of decomposition and composition of LLMs in N2F. The detailed analysis of deficiencies and attributions can help subsequent improvements of LLMs.

* Accepted at NAACL 2025 main conference

Via

Access Paper or Ask Questions

Towards a Unified View of Preference Learning for Large Language Models: A Survey

Sep 04, 2024

Bofei Gao, Feifan Song, Yibo Miao, Zefan Cai, Zhe Yang, Liang Chen, Helan Hu, Runxin Xu, Qingxiu Dong, Ce Zheng(+14 more)

Figure 1 for Towards a Unified View of Preference Learning for Large Language Models: A Survey

Figure 2 for Towards a Unified View of Preference Learning for Large Language Models: A Survey

Figure 3 for Towards a Unified View of Preference Learning for Large Language Models: A Survey

Figure 4 for Towards a Unified View of Preference Learning for Large Language Models: A Survey

Abstract:Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial factors to achieve success is aligning the LLM's output with human preferences. This alignment process often requires only a small amount of data to efficiently enhance the LLM's performance. While effective, research in this area spans multiple domains, and the methods involved are relatively complex to understand. The relationships between different methods have been under-explored, limiting the development of the preference alignment. In light of this, we break down the existing popular alignment strategies into different components and provide a unified framework to study the current alignment strategies, thereby establishing connections among them. In this survey, we decompose all the strategies in preference learning into four components: model, data, feedback, and algorithm. This unified view offers an in-depth understanding of existing alignment algorithms and also opens up possibilities to synergize the strengths of different strategies. Furthermore, we present detailed working examples of prevalent existing algorithms to facilitate a comprehensive understanding for the readers. Finally, based on our unified perspective, we explore the challenges and future research directions for aligning large language models with human preferences.

* Initial Commit, 21 pages

Via

Access Paper or Ask Questions

HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

Jun 11, 2024

Wen Luo, Tianshu Shen, Wei Li, Guangyue Peng, Richeng Xuan, Houfeng Wang, Xi Yang

Figure 1 for HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

Figure 2 for HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

Figure 3 for HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

Figure 4 for HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

Abstract:Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), achieving remarkable performance across diverse tasks and enabling widespread real-world applications. However, LLMs are prone to hallucination, generating content that either conflicts with established knowledge or is unfaithful to the original sources. Existing hallucination benchmarks primarily focus on sentence- or passage-level hallucination detection, neglecting dialogue-level evaluation, hallucination localization, and rationale provision. They also predominantly target factuality hallucinations while underestimating faithfulness hallucinations, often relying on labor-intensive or non-specialized evaluators. To address these limitations, we propose HalluDial, the first comprehensive large-scale benchmark for automatic dialogue-level hallucination evaluation. HalluDial encompasses both spontaneous and induced hallucination scenarios, covering factuality and faithfulness hallucinations. The benchmark includes 4,094 dialogues with a total of 146,856 samples. Leveraging HalluDial, we conduct a comprehensive meta-evaluation of LLMs' hallucination evaluation capabilities in information-seeking dialogues and introduce a specialized judge language model, HalluJudge. The high data quality of HalluDial enables HalluJudge to achieve superior or competitive performance in hallucination evaluation, facilitating the automatic assessment of dialogue-level hallucinations in LLMs and providing valuable insights into this phenomenon. The dataset and the code are available at https://github.com/FlagOpen/HalluDial.

Via

Access Paper or Ask Questions

Detection-Correction Structure via General Language Model for Grammatical Error Correction

May 28, 2024

Wei Li, Houfeng Wang

Figure 1 for Detection-Correction Structure via General Language Model for Grammatical Error Correction

Figure 2 for Detection-Correction Structure via General Language Model for Grammatical Error Correction

Figure 3 for Detection-Correction Structure via General Language Model for Grammatical Error Correction

Figure 4 for Detection-Correction Structure via General Language Model for Grammatical Error Correction

Abstract:Grammatical error correction (GEC) is a task dedicated to rectifying texts with minimal edits, which can be decoupled into two components: detection and correction. However, previous works have predominantly focused on direct correction, with no prior efforts to integrate both into a single model. Moreover, the exploration of the detection-correction paradigm by large language models (LLMs) remains underdeveloped. This paper introduces an integrated detection-correction structure, named DeCoGLM, based on the General Language Model (GLM). The detection phase employs a fault-tolerant detection template, while the correction phase leverages autoregressive mask infilling for localized error correction. Through the strategic organization of input tokens and modification of attention masks, we facilitate multi-task learning within a single model. Our model demonstrates competitive performance against the state-of-the-art models on English and Chinese GEC datasets. Further experiments present the effectiveness of the detection-correction structure in LLMs, suggesting a promising direction for GEC.

* Long paper. Accepted by ACL 2024 Main Conference

Via

Access Paper or Ask Questions

SPOR: A Comprehensive and Practical Evaluation Method for Compositional Generalization in Data-to-Text Generation

May 17, 2024

Ziyao Xu, Houfeng Wang

Figure 1 for SPOR: A Comprehensive and Practical Evaluation Method for Compositional Generalization in Data-to-Text Generation

Figure 2 for SPOR: A Comprehensive and Practical Evaluation Method for Compositional Generalization in Data-to-Text Generation

Figure 3 for SPOR: A Comprehensive and Practical Evaluation Method for Compositional Generalization in Data-to-Text Generation

Figure 4 for SPOR: A Comprehensive and Practical Evaluation Method for Compositional Generalization in Data-to-Text Generation

Abstract:Compositional generalization is an important ability of language models and has many different manifestations. For data-to-text generation, previous research on this ability is limited to a single manifestation called Systematicity and lacks consideration of large language models (LLMs), which cannot fully cover practical application scenarios. In this work, we propose SPOR, a comprehensive and practical evaluation method for compositional generalization in data-to-text generation. SPOR includes four aspects of manifestations (Systematicity, Productivity, Order invariance, and Rule learnability) and allows high-quality evaluation without additional manual annotations based on existing datasets. We demonstrate SPOR on two different datasets and evaluate some existing language models including LLMs. We find that the models are deficient in various aspects of the evaluation and need further improvement. Our work shows the necessity for comprehensive research on different manifestations of compositional generalization in data-to-text generation and provides a framework for evaluation.

Via

Access Paper or Ask Questions

Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment

Mar 30, 2024

Feifan Song, Bowen Yu, Hao Lang, Haiyang Yu, Fei Huang, Houfeng Wang, Yongbin Li

Figure 1 for Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment

Figure 2 for Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment

Figure 3 for Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment

Figure 4 for Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment

Abstract:Alignment with human preference prevents large language models (LLMs) from generating misleading or toxic content while requiring high-cost human feedback. Assuming resources of human annotation are limited, there are two different ways of allocating considered: more diverse PROMPTS or more diverse RESPONSES to be labeled. Nonetheless, a straightforward comparison between their impact is absent. In this work, we first control the diversity of both sides according to the number of samples for fine-tuning, which can directly reflect their influence. We find that instead of numerous prompts, more responses but fewer prompts better trigger LLMs for human alignment. Additionally, the concept of diversity for prompts can be more complex than responses that are typically quantified by single digits. Consequently, a new formulation of prompt diversity is proposed, further implying a linear correlation with the final performance of LLMs after fine-tuning. We also leverage it on data augmentation and conduct experiments to show its effect on different algorithms.

* Accepted by LREC-COLING 2024

Via

Access Paper or Ask Questions

Utilizing Local Hierarchy with Adversarial Training for Hierarchical Text Classification

Feb 29, 2024

Zihan Wang, Peiyi Wang, Houfeng Wang

Figure 1 for Utilizing Local Hierarchy with Adversarial Training for Hierarchical Text Classification

Figure 2 for Utilizing Local Hierarchy with Adversarial Training for Hierarchical Text Classification

Figure 3 for Utilizing Local Hierarchy with Adversarial Training for Hierarchical Text Classification

Figure 4 for Utilizing Local Hierarchy with Adversarial Training for Hierarchical Text Classification

Abstract:Hierarchical text classification (HTC) is a challenging subtask of multi-label classification due to its complex taxonomic structure. Nearly all recent HTC works focus on how the labels are structured but ignore the sub-structure of ground-truth labels according to each input text which contains fruitful label co-occurrence information. In this work, we introduce this local hierarchy with an adversarial framework. We propose a HiAdv framework that can fit in nearly all HTC models and optimize them with the local hierarchy as auxiliary information. We test on two typical HTC models and find that HiAdv is effective in all scenarios and is adept at dealing with complex taxonomic hierarchies. Further experiments demonstrate that the promotion of our framework indeed comes from the local hierarchy and the local hierarchy is beneficial for rare classes which have insufficient training data.

* Accepted by LREC-COLING 2024

Via

Access Paper or Ask Questions

ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization

Feb 14, 2024

Feifan Song, Yuxuan Fan, Xin Zhang, Peiyi Wang, Houfeng Wang

Figure 1 for ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization

Figure 2 for ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization

Figure 3 for ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization

Figure 4 for ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization

Abstract:Large Language Models (LLMs) rely on Human Preference Alignment (HPA) to ensure the generation of safe content. Due to the heavy cost associated with fine-tuning, fine-tuning-free methods have emerged, typically modifying LLM decoding with external auxiliary methods. However, these methods do not essentially enhance the LLM itself. In this paper, we rethink the derivation procedures of DPO, based on which we conversely build an instant scorer using the states of the LLM before and after In-context Learning (ICL). Accordingly, we propose a novel approach called In-Context Direct Preference Optimization (ICDPO). It enables LLMs to borrow the HPA capabilities from superior LLMs with ICL, generating well-aligned responses as estimated by the aforementioned instant scorer, thereby enhancing the final performance. ICDPO can be further enhanced with a two-stage retriever and an upgraded scorer, both offering benefits. Extensive experiments show its effectiveness, particularly in outperforming two fine-tuning-free baselines, and it exhibits competitiveness with SFT + LoRA. We also conduct detailed analyses to offer comprehensive insights into ICDPO.

Via

Access Paper or Ask Questions

Preference Ranking Optimization for Human Alignment

Jun 30, 2023

Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, Houfeng Wang

Figure 1 for Preference Ranking Optimization for Human Alignment

Figure 2 for Preference Ranking Optimization for Human Alignment

Figure 3 for Preference Ranking Optimization for Human Alignment

Figure 4 for Preference Ranking Optimization for Human Alignment

Abstract:Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secur AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment by combining a reward model, typically based on Bradley-Terry paired comparison, with an RL algorithm such as Proximal Policy Optimization (PPO) to optimize LLM responses. However, RLHF exhibits complexity, instability, and sensitivity to hyperparameters. In this paper, we propose Preference Ranking Optimization (PRO) as an alternative to PPO for directly aligning LLMs with the Bradley-Terry comparison. PRO extends the pairwise Bradley-Terry comparison to accommodate preference rankings of any length. By iteratively contrasting the likelihood of generating responses, PRO instructs the LLM to prioritize the best response while progressively ranking the remaining responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of $n$ responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms existing alignment algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations. Furthermore, we demonstrate that longer, more diverse, and higher-quality preference ranking sequences can consistently enhance the performance of human alignment.

Via

Access Paper or Ask Questions

Semiparametric Language Models Are Scalable Continual Learners

Mar 02, 2023

Guangyue Peng, Tao Ge, Si-Qing Chen, Furu Wei, Houfeng Wang

Abstract:Semiparametric language models (LMs) have shown promise in continuously learning from new text data by combining a parameterized neural LM with a growable non-parametric memory for memorizing new content. However, conventional semiparametric LMs will finally become prohibitive for computing and storing if they are applied to continual learning over streaming data, because the non-parametric memory grows linearly with the amount of data they learn from over time. To address the issue of scalability, we present a simple and intuitive approach called Selective Memorization (SeMem), which only memorizes difficult samples that the model is likely to struggle with. We demonstrate that SeMem improves the scalability of semiparametric LMs for continual learning over streaming data in two ways: (1) data-wise scalability: as the model becomes stronger through continual learning, it will encounter fewer difficult cases that need to be memorized, causing the growth of the non-parametric memory to slow down over time rather than growing at a linear rate with the size of training data; (2) model-wise scalability: SeMem allows a larger model to memorize fewer samples than its smaller counterpart because it is rarer for a larger model to encounter incomprehensible cases, resulting in a non-parametric memory that does not scale linearly with model size. We conduct extensive experiments in language modeling and downstream tasks to test SeMem's results, showing SeMem enables a semiparametric LM to be a scalable continual learner with little forgetting.

* Work in progress

Via

Access Paper or Ask Questions