Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Min-Yen Kan

Columbia University

V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

Nov 05, 2024

Yuxi Xie, Guanzhen Li, Xiao Xu, Min-Yen Kan

Figure 1 for V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

Figure 2 for V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

Figure 3 for V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

Figure 4 for V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization

Abstract:Large vision-language models (LVLMs) suffer from hallucination, resulting in misalignment between the output textual response and the input visual content. Recent research indicates that the over-reliance on the Large Language Model (LLM) backbone, as one cause of the LVLM hallucination, inherently introduces bias from language priors, leading to insufficient context attention to the visual inputs. We tackle this issue of hallucination by mitigating such over-reliance through preference learning. We propose Vision-guided Direct Preference Optimization (V-DPO) to enhance visual context learning at training time. To interpret the effectiveness and generalizability of V-DPO on different types of training data, we construct a synthetic dataset containing both response- and image-contrast preference pairs, compared against existing human-annotated hallucination samples. Our approach achieves significant improvements compared with baseline methods across various hallucination benchmarks. Our analysis indicates that V-DPO excels in learning from image-contrast preference data, demonstrating its superior ability to elicit and understand nuances of visual context. Our code is publicly available at https://github.com/YuxiXie/V-DPO.

* EMNLP 2024 Findings; 9 pages, 6 figures, 5 tables (16 pages, 8 figures, 8 tables including references and appendices)

Via

Access Paper or Ask Questions

Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models

Nov 01, 2024

Do Xuan Long, Duong Ngoc Yen, Anh Tuan Luu, Kenji Kawaguchi, Min-Yen Kan, Nancy F. Chen

Figure 1 for Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models

Figure 2 for Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models

Figure 3 for Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models

Figure 4 for Multi-expert Prompting Improves Reliability, Safety, and Usefulness of Large Language Models

Abstract:We present Multi-expert Prompting, a novel enhancement of ExpertPrompting (Xu et al., 2023), designed to improve the large language model (LLM) generation. Specifically, it guides an LLM to fulfill an input instruction by simulating multiple experts, aggregating their responses, and selecting the best among individual and aggregated responses. This process is performed in a single chain of thoughts through our seven carefully designed subtasks derived from the Nominal Group Technique (Ven and Delbecq, 1974), a well-established decision-making framework. Our evaluations demonstrate that Multi-expert Prompting significantly outperforms ExpertPrompting and comparable baselines in enhancing the truthfulness, factuality, informativeness, and usefulness of responses while reducing toxicity and hurtfulness. It further achieves state-of-the-art truthfulness by outperforming the best baseline by 8.69% with ChatGPT. Multi-expert Prompting is efficient, explainable, and highly adaptable to diverse scenarios, eliminating the need for manual prompt construction.

* EMNLP 2024 Main Conference

Via

Access Paper or Ask Questions

DataTales: A Benchmark for Real-World Intelligent Data Narration

Oct 23, 2024

Yajing Yang, Qian Liu, Min-Yen Kan

Figure 1 for DataTales: A Benchmark for Real-World Intelligent Data Narration

Figure 2 for DataTales: A Benchmark for Real-World Intelligent Data Narration

Figure 3 for DataTales: A Benchmark for Real-World Intelligent Data Narration

Figure 4 for DataTales: A Benchmark for Real-World Intelligent Data Narration

Abstract:We introduce DataTales, a novel benchmark designed to assess the proficiency of language models in data narration, a task crucial for transforming complex tabular data into accessible narratives. Existing benchmarks often fall short in capturing the requisite analytical complexity for practical applications. DataTales addresses this gap by offering 4.9k financial reports paired with corresponding market data, showcasing the demand for models to create clear narratives and analyze large datasets while understanding specialized terminology in the field. Our findings highlights the significant challenge that language models face in achieving the necessary precision and analytical depth for proficient data narration, suggesting promising avenues for future model development and evaluation methodologies.

Via

Access Paper or Ask Questions

CCSBench: Evaluating Compositional Controllability in LLMs for Scientific Document Summarization

Oct 16, 2024

Yixi Ding, Jiaying Wu, Tongyao Zhu, Yanxia Qin, Qian Liu, Min-Yen Kan

Figure 1 for CCSBench: Evaluating Compositional Controllability in LLMs for Scientific Document Summarization

Figure 2 for CCSBench: Evaluating Compositional Controllability in LLMs for Scientific Document Summarization

Figure 3 for CCSBench: Evaluating Compositional Controllability in LLMs for Scientific Document Summarization

Figure 4 for CCSBench: Evaluating Compositional Controllability in LLMs for Scientific Document Summarization

Abstract:To broaden the dissemination of scientific knowledge to diverse audiences, scientific document summarization must simultaneously control multiple attributes such as length and empirical focus. However, existing research typically focuses on controlling single attributes, leaving the compositional control of multiple attributes underexplored. To address this gap, we introduce CCSBench, a benchmark for compositional controllable summarization in the scientific domain. Our benchmark enables fine-grained control over both explicit attributes (e.g., length), which are objective and straightforward, and implicit attributes (e.g., empirical focus), which are more subjective and conceptual. We conduct extensive experiments on GPT-4, LLaMA2, and other popular LLMs under various settings. Our findings reveal significant limitations in large language models' ability to balance trade-offs between control attributes, especially implicit ones that require deeper understanding and abstract reasoning.

Via

Access Paper or Ask Questions

COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement

Oct 12, 2024

Yuxi Xie, Anirudh Goyal, Xiaobao Wu, Xunjian Yin, Xiao Xu, Min-Yen Kan, Liangming Pan, William Yang Wang

Figure 1 for COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement

Figure 2 for COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement

Figure 3 for COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement

Figure 4 for COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement

Abstract:Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. However, existing approaches typically implement iterative refinement at the application or prompting level, relying on autoregressive (AR) modeling. The sequential token generation in AR models can lead to high inference latency. To overcome these challenges, we propose Context-Wise Order-Agnostic Language Modeling (COrAL), which incorporates iterative refinement directly into the LLM architecture while maintaining computational efficiency. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally during the generation process. Leveraging the order-agnostic nature of COrAL, we introduce sliding blockwise order-agnostic decoding, which performs multi-token forward prediction and backward reconstruction within context windows. This allows the model to iteratively refine its outputs in parallel in the sliding block, effectively capturing diverse dependencies without the high inference cost of sequential generation. Empirical evaluations on reasoning tasks demonstrate that COrAL improves performance and inference speed, respectively, achieving absolute accuracy gains of $4.6\%$ on GSM8K and $4.0\%$ on LogiQA, along with inference speedups of up to $3.9\times$ over next-token baselines. Preliminary results on code generation indicate a drop in pass rates due to inconsistencies in order-agnostic outputs, highlighting the inherent quality--speed trade-off. Our code is publicly available at https://github.com/YuxiXie/COrAL.

* 12 pages, 7 figures, 3 tables (23 pages, 9 figures, 4 tables including references and appendices)

Via

Access Paper or Ask Questions

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

Oct 06, 2024

Guanzhen Li, Yuxi Xie, Min-Yen Kan

Figure 1 for MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

Figure 2 for MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

Figure 3 for MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

Figure 4 for MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

Abstract:Humans perform visual perception at multiple levels, including low-level object recognition and high-level semantic interpretation such as behavior understanding. Subtle differences in low-level details can lead to substantial changes in high-level perception. For example, substituting the shopping bag held by a person with a gun suggests violent behavior, implying criminal or violent activity. Despite significant advancements in various multimodal tasks, Large Visual-Language Models (LVLMs) remain unexplored in their capabilities to conduct such multi-level visual perceptions. To investigate the perception gap between LVLMs and humans, we introduce MVP-Bench, the first visual-language benchmark systematically evaluating both low- and high-level visual perception of LVLMs. We construct MVP-Bench across natural and synthetic images to investigate how manipulated content influences model perception. Using MVP-Bench, we diagnose the visual perception of 10 open-source and 2 closed-source LVLMs, showing that high-level perception tasks significantly challenge existing LVLMs. The state-of-the-art GPT-4o only achieves an accuracy of $56\%$ on Yes/No questions, compared with $74\%$ in low-level scenarios. Furthermore, the performance gap between natural and manipulated images indicates that current LVLMs do not generalize in understanding the visual semantics of synthetic images as humans do. Our data and code are publicly available at https://github.com/GuanzhenLi/MVP-Bench.

Via

Access Paper or Ask Questions

TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning

Sep 18, 2024

Xinyuan Lu, Liangming Pan, Yubo Ma, Preslav Nakov, Min-Yen Kan

Figure 1 for TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning

Figure 2 for TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning

Figure 3 for TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning

Figure 4 for TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning

Abstract:Current Large Language Models (LLMs) exhibit limited ability to understand table structures and to apply precise numerical reasoning, which is crucial for tasks such as table question answering (TQA) and table-based fact verification (TFV). To address these challenges, we introduce our Tool-Augmented Reasoning framework for Tables (TART), which integrates LLMs with specialized tools. TART contains three key components: a table formatter to ensure accurate data representation, a tool maker to develop specific computational tools, and an explanation generator to maintain explainability. We also present the TOOLTAB dataset, a new benchmark designed specifically for training LLMs in table-tool integration. Our experiments indicate that TART achieves substantial improvements over existing methods (e.g., Chain-of-Thought) by improving both the precision of data processing and the clarity of the reasoning process. Notably, TART paired with CodeLlama achieves 90.0% of the accuracy of the closed-sourced LLM GPT-3.5-turbo, highlighting its robustness in diverse real-world scenarios. All the code and data are available at https://github.com/XinyuanLu00/TART.

* technical report

Via

Access Paper or Ask Questions

LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs

Aug 16, 2024

Do Xuan Long, Hai Nguyen Ngoc, Tiviatis Sim, Hieu Dao, Shafiq Joty, Kenji Kawaguchi, Nancy F. Chen, Min-Yen Kan

Abstract:We present the first systematic evaluation examining format bias in performance of large language models (LLMs). Our approach distinguishes between two categories of an evaluation metric under format constraints to reliably and accurately assess performance: one measures performance when format constraints are adhered to, while the other evaluates performance regardless of constraint adherence. We then define a metric for measuring the format bias of LLMs and establish effective strategies to reduce it. Subsequently, we present our empirical format bias evaluation spanning four commonly used categories -- multiple-choice question-answer, wrapping, list, and mapping -- covering 15 widely-used formats. Our evaluation on eight generation tasks uncovers significant format bias across state-of-the-art LLMs. We further discover that improving the format-instruction following capabilities of LLMs across formats potentially reduces format bias. Based on our evaluation findings, we study prompting and fine-tuning with synthesized format data techniques to mitigate format bias. Our methods successfully reduce the variance in ChatGPT's performance among wrapping formats from 235.33 to 0.71 (%$^2$).

Via

Access Paper or Ask Questions

The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models

Jun 14, 2024

Yan Liu, Yu Liu, Xiaokang Chen, Pin-Yu Chen, Daoguang Zan, Min-Yen Kan, Tsung-Yi Ho

Figure 1 for The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models

Figure 2 for The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models

Figure 3 for The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models

Figure 4 for The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Pre-trained Language Models

Abstract:Pre-trained Language models (PLMs) have been acknowledged to contain harmful information, such as social biases, which may cause negative social impacts or even bring catastrophic results in application. Previous works on this problem mainly focused on using black-box methods such as probing to detect and quantify social biases in PLMs by observing model outputs. As a result, previous debiasing methods mainly finetune or even pre-train language models on newly constructed anti-stereotypical datasets, which are high-cost. In this work, we try to unveil the mystery of social bias inside language models by introducing the concept of {\sc Social Bias Neurons}. Specifically, we propose {\sc Integrated Gap Gradients (IG$^2$)} to accurately pinpoint units (i.e., neurons) in a language model that can be attributed to undesirable behavior, such as social bias. By formalizing undesirable behavior as a distributional property of language, we employ sentiment-bearing prompts to elicit classes of sensitive words (demographics) correlated with such sentiments. Our IG$^2$ thus attributes the uneven distribution for different demographics to specific Social Bias Neurons, which track the trail of unwanted behavior inside PLM units to achieve interoperability. Moreover, derived from our interpretable technique, {\sc Bias Neuron Suppression (BNS)} is further proposed to mitigate social biases. By studying BERT, RoBERTa, and their attributable differences from debiased FairBERTa, IG$^2$ allows us to locate and suppress identified neurons, and further mitigate undesired behaviors. As measured by prior metrics from StereoSet, our model achieves a higher degree of fairness while maintaining language modeling ability with low cost.

Via

Access Paper or Ask Questions

Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework

May 24, 2024

Minzhi Li, Zhengyuan Liu, Shumin Deng, Shafiq Joty, Nancy F. Chen, Min-Yen Kan

Figure 1 for Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework

Figure 2 for Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework

Figure 3 for Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework

Figure 4 for Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework

Abstract:The acceleration of Large Language Models (LLMs) research has opened up new possibilities for evaluating generated texts. They serve as scalable and economical evaluators, but the question of how reliable these evaluators are has emerged as a crucial research question. Prior research efforts in the meta-evaluation of LLMs as judges limit the prompting of an LLM to a single use to obtain a final evaluation decision. They then compute the agreement between LLMs' outputs and human labels. This lacks interpretability in understanding the evaluation capability of LLMs. In light of this challenge, we propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices. Our experiments illustrate that it not only provides a more interpretable window for how well LLMs evaluate, but also leads to improvements up to 39.6% for different LLMs on a variety of meta-evaluation benchmarks.

Via

Access Paper or Ask Questions