Alert button
Picture for Shi Feng

Shi Feng

Alert button

Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong

Oct 19, 2023
Chenglei Si, Navita Goyal, Sherry Tongshuang Wu, Chen Zhao, Shi Feng, Hal Daumé III, Jordan Boyd-Graber

Large Language Models (LLMs) are increasingly used for accessing information on the web. Their truthfulness and factuality are thus of great interest. To help users make the right decisions about the information they're getting, LLMs should not only provide but also help users fact-check information. In this paper, we conduct experiments with 80 crowdworkers in total to compare language models with search engines (information retrieval systems) at facilitating fact-checking by human users. We prompt LLMs to validate a given claim and provide corresponding explanations. Users reading LLM explanations are significantly more efficient than using search engines with similar accuracy. However, they tend to over-rely the LLMs when the explanation is wrong. To reduce over-reliance on LLMs, we ask LLMs to provide contrastive information - explain both why the claim is true and false, and then we present both sides of the explanation to users. This contrastive explanation mitigates users' over-reliance on LLMs, but cannot significantly outperform search engines. However, showing both search engine results and LLM explanations offers no complementary benefits as compared to search engines alone. Taken together, natural language explanations by LLMs may not be a reliable replacement for reading the retrieved passages yet, especially in high-stakes settings where over-relying on wrong AI explanations could lead to critical consequences.

* preprint 
Viaarxiv icon

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

Oct 13, 2023
Xiaocui Yang, Wenfang Wu, Shi Feng, Ming Wang, Daling Wang, Yang Li, Qi Sun, Yifei Zhang, Xiaoming Fu, Soujanya Poria

Figure 1 for MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks
Figure 2 for MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks
Figure 3 for MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks
Figure 4 for MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

The popularity of multimodal large language models (MLLMs) has triggered a recent surge in research efforts dedicated to evaluating these models. Nevertheless, existing evaluation studies of MLLMs primarily focus on the comprehension and reasoning of unimodal (vision) content, neglecting performance evaluations in the domain of multimodal (vision-language) content understanding. Beyond multimodal reasoning, tasks related to multimodal content comprehension necessitate a profound understanding of multimodal contexts, achieved through the multimodal interaction to obtain a final answer. In this paper, we introduce a comprehensive assessment framework called MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions across a wide spectrum of diverse multimodal content comprehension tasks. Consequently, our work complements research on the performance of MLLMs in multimodal comprehension tasks, achieving a more comprehensive and holistic evaluation of MLLMs. To begin, we employ the Best Performance metric to ascertain each model's performance upper bound on different datasets. Subsequently, the Mean Relative Gain metric offers an assessment of the overall performance of various models and instructions, while the Stability metric measures their sensitivity. Furthermore, previous research centers on evaluating models independently or solely assessing instructions, neglecting the adaptability between models and instructions. We propose the Adaptability metric to quantify the adaptability between models and instructions. Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights. Our code will be released at https://github.com/declare-lab/MM-BigBench.

* Underview 
Viaarxiv icon

Machine learning assist nyc subway navigation safer and faster

Oct 03, 2023
Wencheng Bao, Shi Feng

Mainstream navigation software, like Google and Apple Maps, often lacks the ability to provide routes prioritizing safety. However, safety remains a paramount concern for many. Our aim is to strike a balance between safety and efficiency. To achieve this, we're devising an Integer Programming model that takes into account both the shortest path and the safest route. We will harness machine learning to derive safety coefficients, employing methodologies such as generalized linear models, linear regression, and recurrent neural networks. Our evaluation will be based on the Root Mean Square Error (RMSE) across various subway stations, helping us identify the most accurate model for safety coefficient estimation. Furthermore, we'll conduct a comprehensive review of different shortest-path algorithms, assessing them based on time complexity and real-world data to determine their appropriateness in merging both safety and time efficiency.

* 7 pages, 3 figures 
Viaarxiv icon

T-COL: Generating Counterfactual Explanations for General User Preferences on Variable Machine Learning Systems

Sep 28, 2023
Ming Wang, Daling Wang, Wenfang Wu, Shi Feng, Yifei Zhang

Figure 1 for T-COL: Generating Counterfactual Explanations for General User Preferences on Variable Machine Learning Systems
Figure 2 for T-COL: Generating Counterfactual Explanations for General User Preferences on Variable Machine Learning Systems
Figure 3 for T-COL: Generating Counterfactual Explanations for General User Preferences on Variable Machine Learning Systems
Figure 4 for T-COL: Generating Counterfactual Explanations for General User Preferences on Variable Machine Learning Systems

Machine learning (ML) based systems have been suffering a lack of interpretability. To address this problem, counterfactual explanations (CEs) have been proposed. CEs are unique as they provide workable suggestions to users, in addition to explaining why a certain outcome was predicted. However, the application of CEs has been hindered by two main challenges, namely general user preferences and variable ML systems. User preferences, in particular, tend to be general rather than specific feature values. Additionally, CEs need to be customized to suit the variability of ML models, while also maintaining robustness even when these validation models change. To overcome these challenges, we propose several possible general user preferences that have been validated by user research and map them to the properties of CEs. We also introduce a new method called \uline{T}ree-based \uline{C}onditions \uline{O}ptional \uline{L}inks (T-COL), which has two optional structures and several groups of conditions for generating CEs that can be adapted to general user preferences. Meanwhile, a group of conditions lead T-COL to generate more robust CEs that have higher validity when the ML model is replaced. We compared the properties of CEs generated by T-COL experimentally under different user preferences and demonstrated that T-COL is better suited for accommodating user preferences and variable ML systems compared to baseline methods including Large Language Models.

Viaarxiv icon

RoCar: A Relationship Network-based Evaluation Method to Large Language Models

Jul 29, 2023
Ming Wang, Wenfang Wu, Chongyun Gao, Daling Wang, Shi Feng, Yifei Zhang

Figure 1 for RoCar: A Relationship Network-based Evaluation Method to Large Language Models
Figure 2 for RoCar: A Relationship Network-based Evaluation Method to Large Language Models
Figure 3 for RoCar: A Relationship Network-based Evaluation Method to Large Language Models
Figure 4 for RoCar: A Relationship Network-based Evaluation Method to Large Language Models

Large language models (LLMs) have received increasing attention. However, due to the complexity of its capabilities, how to rationally evaluate the capabilities of LLMs is still a task to be solved. We propose the RoCar method, which utilizes the defined basic schemas to randomly construct a task graph and generates natural language evaluation tasks based on the task graph to evaluate the reasoning and memory abilities of LLMs respectively. Due to the very large randomness of the task construction process, it is possible to ensure that none of the LLMs to be tested has directly learned the evaluation tasks, guaranteeing the fairness of the evaluation method.

Viaarxiv icon

Machine learning feature discovery of spinon Fermi surface

Jun 05, 2023
Kevin Zhang, Shi Feng, Yuri D. Lensky, Nandini Trivedi, Eun-Ah Kim

Figure 1 for Machine learning feature discovery of spinon Fermi surface
Figure 2 for Machine learning feature discovery of spinon Fermi surface
Figure 3 for Machine learning feature discovery of spinon Fermi surface
Figure 4 for Machine learning feature discovery of spinon Fermi surface

With rapid progress in simulation of strongly interacting quantum Hamiltonians, the challenge in characterizing unknown phases becomes a bottleneck for scientific progress. We demonstrate that a Quantum-Classical hybrid approach (QuCl) of mining the projective snapshots with interpretable classical machine learning, can unveil new signatures of seemingly featureless quantum states. The Kitaev-Heisenberg model on a honeycomb lattice with bond-dependent frustrated interactions presents an ideal system to test QuCl. The model hosts a wealth of quantum spin liquid states: gapped and gapless $\mathbb{Z}_2$ spin liquids, and a chiral spin liquid (CSL) phase in a small external magnetic field. Recently, various simulations have found a new intermediate gapless phase (IGP), sandwiched between the CSL and a partially polarized phase, launching a debate over its elusive nature. We reveal signatures of phases in the model by contrasting two phases pairwise using an interpretable neural network, the correlator convolutional neural network (CCNN). We train the CCNN with a labeled collection of sampled projective measurements and reveal signatures of each phase through regularization path analysis. We show that QuCl reproduces known features of established spin liquid phases and ordered phases. Most significantly, we identify a signature motif of the field-induced IGP in the spin channel perpendicular to the field direction, which we interpret as a signature of Friedel oscillations of gapless spinons forming a Fermi surface. Our predictions can guide future experimental searches for $U(1)$ spin liquids.

* 8 pages + 8 pages supplemental 
Viaarxiv icon

Evaluate What You Can't Evaluate: Unassessable Generated Responses Quality

May 24, 2023
Yongkang Liu, Shi Feng, Daling Wang, Yifei Zhang, Hinrich Schütze

Figure 1 for Evaluate What You Can't Evaluate: Unassessable Generated Responses Quality
Figure 2 for Evaluate What You Can't Evaluate: Unassessable Generated Responses Quality
Figure 3 for Evaluate What You Can't Evaluate: Unassessable Generated Responses Quality
Figure 4 for Evaluate What You Can't Evaluate: Unassessable Generated Responses Quality

LLMs (large language models) such as ChatGPT have shown remarkable language understanding and generation capabilities. Although reference-free evaluators based on LLMs show better human alignment than traditional reference-based evaluators, there are many challenges in using reference-free evaluators based on LLMs. Reference-free evaluators are more suitable for open-ended examples with different semantics responses. But not all examples are open-ended. For closed-ended examples with unique correct semantic response, reference-free evaluators will still consider it high quality when giving a response that is inconsistent with the facts and the semantic of reference. In order to comprehensively evaluate the reliability of evaluators based on LLMs, we construct two adversarial meta-evaluation dialogue generation datasets KdConv-ADV and DSTC7-ADV based on KdConv and DSTC7-AVSD, respectively. Compared to previous meta-evaluation benchmarks, KdConv-ADV and DSTC7-ADV are much more challenging since they requires evaluators to be able to reasonably evaluate closed-ended examples with the help of external knowledge or even its own knowledge. Empirical results show that the ability of LLMs to identify unreasonable responses is insufficient. There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.

* preprint 
Viaarxiv icon

Measuring Inductive Biases of In-Context Learning with Underspecified Demonstrations

May 22, 2023
Chenglei Si, Dan Friedman, Nitish Joshi, Shi Feng, Danqi Chen, He He

Figure 1 for Measuring Inductive Biases of In-Context Learning with Underspecified Demonstrations
Figure 2 for Measuring Inductive Biases of In-Context Learning with Underspecified Demonstrations
Figure 3 for Measuring Inductive Biases of In-Context Learning with Underspecified Demonstrations
Figure 4 for Measuring Inductive Biases of In-Context Learning with Underspecified Demonstrations

In-context learning (ICL) is an important paradigm for adapting large language models (LLMs) to new tasks, but the generalization behavior of ICL remains poorly understood. We investigate the inductive biases of ICL from the perspective of feature bias: which feature ICL is more likely to use given a set of underspecified demonstrations in which two features are equally predictive of the labels. First, we characterize the feature biases of GPT-3 models by constructing underspecified demonstrations from a range of NLP datasets and feature combinations. We find that LLMs exhibit clear feature biases - for example, demonstrating a strong bias to predict labels according to sentiment rather than shallow lexical features, like punctuation. Second, we evaluate the effect of different interventions that are designed to impose an inductive bias in favor of a particular feature, such as adding a natural language instruction or using semantically relevant label words. We find that, while many interventions can influence the learner to prefer a particular feature, it can be difficult to overcome strong prior biases. Overall, our results provide a broader picture of the types of features that ICL may be more likely to exploit and how to impose inductive biases that are better aligned with the intended task.

* ACL 2023 
Viaarxiv icon

Learning Human-Compatible Representations for Case-Based Decision Support

Mar 06, 2023
Han Liu, Yizhou Tian, Chacha Chen, Shi Feng, Yuxin Chen, Chenhao Tan

Figure 1 for Learning Human-Compatible Representations for Case-Based Decision Support
Figure 2 for Learning Human-Compatible Representations for Case-Based Decision Support
Figure 3 for Learning Human-Compatible Representations for Case-Based Decision Support
Figure 4 for Learning Human-Compatible Representations for Case-Based Decision Support

Algorithmic case-based decision support provides examples to help human make sense of predicted labels and aid human in decision-making tasks. Despite the promising performance of supervised learning, representations learned by supervised models may not align well with human intuitions: what models consider as similar examples can be perceived as distinct by humans. As a result, they have limited effectiveness in case-based decision support. In this work, we incorporate ideas from metric learning with supervised learning to examine the importance of alignment for effective decision support. In addition to instance-level labels, we use human-provided triplet judgments to learn human-compatible decision-focused representations. Using both synthetic data and human subject experiments in multiple classification tasks, we demonstrate that such representation is better aligned with human perception than representation solely optimized for classification. Human-compatible representations identify nearest neighbors that are perceived as more similar by humans and allow humans to make more accurate predictions, leading to substantial improvements in human decision accuracies (17.8% in butterfly vs. moth classification and 13.2% in pneumonia classification).

* Accepted at ICLR 2023 
Viaarxiv icon

Combinatorial Causal Bandits without Graph Skeleton

Jan 31, 2023
Shi Feng, Nuoya Xiong, Wei Chen

Figure 1 for Combinatorial Causal Bandits without Graph Skeleton
Figure 2 for Combinatorial Causal Bandits without Graph Skeleton

In combinatorial causal bandits (CCB), the learning agent chooses a subset of variables in each round to intervene and collects feedback from the observed variables to minimize expected regret or sample complexity. Previous works study this problem in both general causal models and binary generalized linear models (BGLMs). However, all of them require prior knowledge of causal graph structure. This paper studies the CCB problem without the graph structure on binary general causal models and BGLMs. We first provide an exponential lower bound of cumulative regrets for the CCB problem on general causal models. To overcome the exponentially large space of parameters, we then consider the CCB problem on BGLMs. We design a regret minimization algorithm for BGLMs even without the graph skeleton and show that it still achieves $O(\sqrt{T}\ln T)$ expected regret. This asymptotic regret is the same as the state-of-art algorithms relying on the graph structure. Moreover, we sacrifice the regret to $O(T^{\frac{2}{3}}\ln T)$ to remove the weight gap covered by the asymptotic notation. At last, we give some discussions and algorithms for pure exploration of the CCB problem without the graph structure.

* 36 pages, 2 figures 
Viaarxiv icon