Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yue Zhang

Renmin University of China

Tables as Images? Exploring the Strengths and Limitations of LLMs on Multimodal Representations of Tabular Data

Feb 23, 2024

Naihao Deng, Zhenjie Sun, Ruiqi He, Aman Sikka, Yulong Chen, Lin Ma, Yue Zhang, Rada Mihalcea

Figure 1 for Tables as Images? Exploring the Strengths and Limitations of LLMs on Multimodal Representations of Tabular Data

Figure 2 for Tables as Images? Exploring the Strengths and Limitations of LLMs on Multimodal Representations of Tabular Data

Figure 3 for Tables as Images? Exploring the Strengths and Limitations of LLMs on Multimodal Representations of Tabular Data

Figure 4 for Tables as Images? Exploring the Strengths and Limitations of LLMs on Multimodal Representations of Tabular Data

Abstract:In this paper, we investigate the effectiveness of various LLMs in interpreting tabular data through different prompting strategies and data formats. Our analysis extends across six benchmarks for table-related tasks such as question-answering and fact-checking. We introduce for the first time the assessment of LLMs' performance on image-based table representations. Specifically, we compare five text-based and three image-based table representations, demonstrating the influence of representation and prompting on LLM performance. Our study provides insights into the effective use of LLMs on table-related tasks.

Via

Access Paper or Ask Questions

RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models

Feb 22, 2024

Jianhao Yan, Yun Luo, Yue Zhang

Figure 1 for RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models

Figure 2 for RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models

Figure 3 for RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models

Figure 4 for RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models

Abstract:The application scope of large language models (LLMs) is increasingly expanding. In practical use, users might provide feedback based on the model's output, hoping for a responsive model that can complete responses according to their feedback. Whether the model can appropriately respond to users' refuting feedback and consistently follow through with execution has not been thoroughly analyzed. In light of this, this paper proposes a comprehensive benchmark, RefuteBench, covering tasks such as question answering, machine translation, and email writing. The evaluation aims to assess whether models can positively accept feedback in form of refuting instructions and whether they can consistently adhere to user demands throughout the conversation. We conduct evaluations on numerous LLMs and find that LLMs are stubborn, i.e. exhibit inclination to their internal knowledge, often failing to comply with user feedback. Additionally, as the length of the conversation increases, models gradually forget the user's stated feedback and roll back to their own responses. We further propose a recall-and-repeat prompts as a simple and effective way to enhance the model's responsiveness to feedback.

* Work in progress

Via

Access Paper or Ask Questions

Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form Medical Question Answering Applications and Beyond

Feb 22, 2024

Zhiyuan Wang, Jinhao Duan, Chenxi Yuan, Qingyu Chen, Tianlong Chen, Huaxiu Yao, Yue Zhang, Ren Wang, Kaidi Xu, Xiaoshuang Shi

Figure 1 for Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form Medical Question Answering Applications and Beyond

Figure 2 for Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form Medical Question Answering Applications and Beyond

Figure 3 for Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form Medical Question Answering Applications and Beyond

Figure 4 for Word-Sequence Entropy: Towards Uncertainty Estimation in Free-Form Medical Question Answering Applications and Beyond

Abstract:Uncertainty estimation plays a pivotal role in ensuring the reliability of safety-critical human-AI interaction systems, particularly in the medical domain. However, a general method for quantifying the uncertainty of free-form answers has yet to be established in open-ended medical question-answering (QA) tasks, where irrelevant words and sequences with limited semantic information can be the primary source of uncertainty due to the presence of generative inequality. In this paper, we propose the Word-Sequence Entropy (WSE), which calibrates the uncertainty proportion at both the word and sequence levels according to the semantic relevance, with greater emphasis placed on keywords and more relevant sequences when performing uncertainty quantification. We compare WSE with 6 baseline methods on 5 free-form medical QA datasets, utilizing 7 "off-the-shelf" large language models (LLMs), and show that WSE exhibits superior performance on accurate uncertainty measurement under two standard criteria for correctness evaluation (e.g., WSE outperforms existing state-of-the-art method by 3.23% AUROC on the MedQA dataset). Additionally, in terms of the potential for real-world medical QA applications, we achieve a significant enhancement in the performance of LLMs when employing sequences with lower uncertainty, identified by WSE, as final answers (e.g., +6.36% accuracy improvement on the COVID-QA dataset), without requiring any additional task-specific fine-tuning or architectural modifications.

* 18 pages

Via

Access Paper or Ask Questions

Potential and Challenges of Model Editing for Social Debiasing

Feb 21, 2024

Jianhao Yan, Futing Wang, Yafu Li, Yue Zhang

Abstract:Large language models (LLMs) trained on vast corpora suffer from inevitable stereotype biases. Mitigating these biases with fine-tuning could be both costly and data-hungry. Model editing methods, which focus on modifying LLMs in a post-hoc manner, are of great potential to address debiasing. However, it lacks a comprehensive study that facilitates both internal and external model editing methods, supports various bias types, as well as understands the pros and cons of applying editing methods to stereotypical debiasing. To mitigate this gap, we carefully formulate social debiasing into an editing problem and benchmark seven existing model editing algorithms on stereotypical debiasing, i.e., debias editing. Our findings in three scenarios reveal both the potential and challenges of debias editing: (1) Existing model editing methods can effectively preserve knowledge and mitigate biases, while the generalization of debias effect from edited sentences to semantically equivalent sentences is limited.(2) Sequential editing highlights the robustness of SERAC (Mitchell et al. 2022b), while internal editing methods degenerate with the number of edits. (3) Model editing algorithms achieve generalization towards unseen biases both within the same type and from different types. In light of these findings, we further propose two simple but effective methods to improve debias editing, and experimentally show the effectiveness of the proposed methods.

* Work in progress

Via

Access Paper or Ask Questions

SQL-CRAFT: Text-to-SQL through Interactive Refinement and Enhanced Reasoning

Feb 20, 2024

Hanchen Xia, Feng Jiang, Naihao Deng, Cunxiang Wang, Guojiang Zhao, Rada Mihalcea, Yue Zhang

Figure 1 for SQL-CRAFT: Text-to-SQL through Interactive Refinement and Enhanced Reasoning

Figure 2 for SQL-CRAFT: Text-to-SQL through Interactive Refinement and Enhanced Reasoning

Figure 3 for SQL-CRAFT: Text-to-SQL through Interactive Refinement and Enhanced Reasoning

Figure 4 for SQL-CRAFT: Text-to-SQL through Interactive Refinement and Enhanced Reasoning

Abstract:Modern LLMs have become increasingly powerful, but they are still facing challenges in specialized tasks such as Text-to-SQL. We propose SQL-CRAFT, a framework to advance LLMs' SQL generation Capabilities through inteRActive reFinemenT and enhanced reasoning. We leverage an Interactive Correction Loop (IC-Loop) for LLMs to interact with databases automatically, as well as Python-enhanced reasoning. We conduct experiments on two Text-to-SQL datasets, Spider and Bird, with performance improvements of up to 5.7% compared to the naive prompting method. Moreover, our method surpasses the current state-of-the-art on the Spider Leaderboard, demonstrating the effectiveness of our framework.

* 11 pages, 3 figures, 6 tables

Via

Access Paper or Ask Questions

MRKE: The Multi-hop Reasoning Evaluation of LLMs by Knowledge Edition

Feb 19, 2024

Jian Wu, Linyi Yang, Manabu Okumura, Yue Zhang

Figure 1 for MRKE: The Multi-hop Reasoning Evaluation of LLMs by Knowledge Edition

Figure 2 for MRKE: The Multi-hop Reasoning Evaluation of LLMs by Knowledge Edition

Figure 3 for MRKE: The Multi-hop Reasoning Evaluation of LLMs by Knowledge Edition

Figure 4 for MRKE: The Multi-hop Reasoning Evaluation of LLMs by Knowledge Edition

Abstract:Although Large Language Models (LLMs) have shown strong performance in Multi-hop Question Answering (MHQA) tasks, their real reasoning ability remains exploration. Current LLM QA evaluation benchmarks have shown limitations, including 1) data contamination, the evaluation data are potentially exposed to LLMs during the pretraining stage; and 2) ignoration of the reasoning chain evaluation. Thus we introduce an LLM MHQA evaluation benchmark, the first QA benchmark based on the new, unprecedented knowledge by editing the off-the-shelf HotpotQA dataset; Besides, we also annotate and evaluate the reasoning chain in the form of sub-questions and intermediate answers corresponding to the multi-hop questions. Specifically, based on the observation, 1) LLMs show a performance gap between the original HotpotQA and our edited data, deeming that current MHQA benchmarks have the potential risk of data contamination that hard to evaluate LLMs' performance objectively and scientifically; 2) LLMs only get a small percentage of the right reasoning chain, e.g. GPT-4 only gets 36.3\% right reasoning chain. We believe this new Multi-hop QA evaluation benchmark and novel evaluation methods will facilitate the development of trustworthy LLM evaluation on the MHQA task.

Via

Access Paper or Ask Questions

Fine-grained and Explainable Factuality Evaluation for Multimodal Summarization

Feb 18, 2024

Liqiang Jing, Jingxuan Zuo, Yue Zhang

Abstract:Multimodal summarization aims to generate a concise summary based on the input text and image. However, the existing methods potentially suffer from unfactual output. To evaluate the factuality of multimodal summarization models, we propose two fine-grained and explainable evaluation frameworks (FALLACIOUS) for different application scenarios, i.e. reference-based factuality evaluation framework and reference-free factuality evaluation framework. Notably, the reference-free factuality evaluation framework doesn't need ground truth and hence it has a wider application scenario. To evaluate the effectiveness of the proposed frameworks, we compute the correlation between our frameworks and the other metrics. The experimental results show the effectiveness of our proposed method. We will release our code and dataset via github.

Via

Access Paper or Ask Questions

Detecting Multimedia Generated by Large AI Models: A Survey

Feb 07, 2024

Li Lin, Neeraj Gupta, Yue Zhang, Hainan Ren, Chun-Hao Liu, Feng Ding, Xin Wang, Xin Li, Luisa Verdoliva, Shu Hu

Figure 1 for Detecting Multimedia Generated by Large AI Models: A Survey

Figure 2 for Detecting Multimedia Generated by Large AI Models: A Survey

Figure 3 for Detecting Multimedia Generated by Large AI Models: A Survey

Figure 4 for Detecting Multimedia Generated by Large AI Models: A Survey

Abstract:The rapid advancement of Large AI Models (LAIMs), particularly diffusion models and large language models, has marked a new era where AI-generated multimedia is increasingly integrated into various aspects of daily life. Although beneficial in numerous fields, this content presents significant risks, including potential misuse, societal disruptions, and ethical concerns. Consequently, detecting multimedia generated by LAIMs has become crucial, with a marked rise in related research. Despite this, there remains a notable gap in systematic surveys that focus specifically on detecting LAIM-generated multimedia. Addressing this, we provide the first survey to comprehensively cover existing research on detecting multimedia (such as text, images, videos, audio, and multimodal content) created by LAIMs. Specifically, we introduce a novel taxonomy for detection methods, categorized by media modality, and aligned with two perspectives: pure detection (aiming to enhance detection performance) and beyond detection (adding attributes like generalizability, robustness, and interpretability to detectors). Additionally, we have presented a brief overview of generation mechanisms, public datasets, and online detection tools to provide a valuable resource for researchers and practitioners in this field. Furthermore, we identify current challenges in detection and propose directions for future research that address unexplored, ongoing, and emerging issues in detecting multimedia generated by LAIMs. Our aim for this survey is to fill an academic gap and contribute to global AI security efforts, helping to ensure the integrity of information in the digital realm. The project link is https://github.com/Purdue-M2/Detect-LAIM-generated-Multimedia-Survey.

Via

Access Paper or Ask Questions

NavHint: Vision and Language Navigation Agent with a Hint Generator

Feb 04, 2024

Yue Zhang, Quan Guo, Parisa Kordjamshidi

Abstract:Existing work on vision and language navigation mainly relies on navigation-related losses to establish the connection between vision and language modalities, neglecting aspects of helping the navigation agent build a deep understanding of the visual environment. In our work, we provide indirect supervision to the navigation agent through a hint generator that provides detailed visual descriptions. The hint generator assists the navigation agent in developing a global understanding of the visual environment. It directs the agent's attention toward related navigation details, including the relevant sub-instruction, potential challenges in recognition and ambiguities in grounding, and the targeted viewpoint description. To train the hint generator, we construct a synthetic dataset based on landmarks in the instructions and visible and distinctive objects in the visual environment. We evaluate our method on the R2R and R4R datasets and achieve state-of-the-art on several metrics. The experimental results demonstrate that generating hints not only enhances the navigation performance but also helps improve the interpretability of the agent's actions.

Via

Access Paper or Ask Questions

Common Sense Reasoning for Deep Fake Detection

Jan 31, 2024

Yue Zhang, Ben Colman, Ali Shahriyari, Gaurav Bharaj

Figure 1 for Common Sense Reasoning for Deep Fake Detection

Figure 2 for Common Sense Reasoning for Deep Fake Detection

Figure 3 for Common Sense Reasoning for Deep Fake Detection

Figure 4 for Common Sense Reasoning for Deep Fake Detection

Abstract:State-of-the-art approaches rely on image-based features extracted via neural networks for the deepfake detection binary classification. While these approaches trained in the supervised sense extract likely fake features, they may fall short in representing unnatural `non-physical' semantic facial attributes -- blurry hairlines, double eyebrows, rigid eye pupils, or unnatural skin shading. However, such facial attributes are generally easily perceived by humans via common sense reasoning. Furthermore, image-based feature extraction methods that provide visual explanation via saliency maps can be hard to be interpreted by humans. To address these challenges, we propose the use of common sense reasoning to model deepfake detection, and extend it to the Deepfake Detection VQA (DD-VQA) task with the aim to model human intuition in explaining the reason behind labeling an image as either real or fake. To this end, we introduce a new dataset that provides answers to the questions related to the authenticity of an image, along with its corresponding explanations. We also propose a Vision and Language Transformer-based framework for the DD-VQA task, incorporating text and image aware feature alignment formulations. Finally, we evaluate our method on both the performance of deepfake detection and the quality of the generated explanations. We hope that this task inspires researchers to explore new avenues for enhancing language-based interpretability and cross-modality applications in the realm of deepfake detection.

Via

Access Paper or Ask Questions