Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

James Zou

Shammie

Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits

May 06, 2024

Jiachen T. Wang, Tianji Yang, James Zou, Yongchan Kwon, Ruoxi Jia

Abstract:Data Shapley provides a principled approach to data valuation and plays a crucial role in data-centric machine learning (ML) research. Data selection is considered a standard application of Data Shapley. However, its data selection performance has shown to be inconsistent across settings in the literature. This study aims to deepen our understanding of this phenomenon. We introduce a hypothesis testing framework and show that Data Shapley's performance can be no better than random selection without specific constraints on utility functions. We identify a class of utility functions, monotonically transformed modular functions, within which Data Shapley optimally selects data. Based on this insight, we propose a heuristic for predicting Data Shapley's effectiveness in data selection tasks. Our experiments corroborate these findings, adding new insights into when Data Shapley may or may not succeed.

* ICML 2024

Via

Access Paper or Ask Questions

Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs

Apr 29, 2024

Valeriia Cherepanova, James Zou

Figure 1 for Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs

Figure 2 for Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs

Figure 3 for Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs

Figure 4 for Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs

Abstract:Large language models (LLMs) exhibit excellent ability to understand human languages, but do they also understand their own language that appears gibberish to us? In this work we delve into this question, aiming to uncover the mechanisms underlying such behavior in LLMs. We employ the Greedy Coordinate Gradient optimizer to craft prompts that compel LLMs to generate coherent responses from seemingly nonsensical inputs. We call these inputs LM Babel and this work systematically studies the behavior of LLMs manipulated by these prompts. We find that the manipulation efficiency depends on the target text's length and perplexity, with the Babel prompts often located in lower loss minima compared to natural prompts. We further examine the structure of the Babel prompts and evaluate their robustness. Notably, we find that guiding the model to generate harmful texts is not more difficult than into generating benign texts, suggesting lack of alignment for out-of-distribution prompts.

Via

Access Paper or Ask Questions

Optimizing Calibration by Gaining Aware of Prediction Correctness

Apr 19, 2024

Yuchi Liu, Lei Wang, Yuli Zou, James Zou, Liang Zheng

Figure 1 for Optimizing Calibration by Gaining Aware of Prediction Correctness

Figure 2 for Optimizing Calibration by Gaining Aware of Prediction Correctness

Figure 3 for Optimizing Calibration by Gaining Aware of Prediction Correctness

Figure 4 for Optimizing Calibration by Gaining Aware of Prediction Correctness

Abstract:Model calibration aims to align confidence with prediction correctness. The Cross-Entropy CE) loss is widely used for calibrator training, which enforces the model to increase confidence on the ground truth class. However, we find the CE loss has intrinsic limitations. For example, for a narrow misclassification, a calibrator trained by the CE loss often produces high confidence on the wrongly predicted class (e.g., a test sample is wrongly classified and its softmax score on the ground truth class is around 0.4), which is undesirable. In this paper, we propose a new post-hoc calibration objective derived from the aim of calibration. Intuitively, the proposed objective function asks that the calibrator decrease model confidence on wrongly predicted samples and increase confidence on correctly predicted samples. Because a sample itself has insufficient ability to indicate correctness, we use its transformed versions (e.g., rotated, greyscaled and color-jittered) during calibrator training. Trained on an in-distribution validation set and tested with isolated, individual test samples, our method achieves competitive calibration performance on both in-distribution and out-of-distribution test sets compared with the state of the art. Further, our analysis points out the difference between our method and commonly used objectives such as CE loss and mean square error loss, where the latters sometimes deviates from the calibration aim.

Via

Access Paper or Ask Questions

STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

Apr 19, 2024

Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Huang, Kaidi Cao, Qian Huang, Vassilis N. Ioannidis, Karthik Subbian, James Zou, Jure Leskovec

Figure 1 for STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

Figure 2 for STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

Figure 3 for STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

Figure 4 for STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases

Abstract:Answering real-world user queries, such as product search, often requires accurate retrieval of information from semi-structured knowledge bases or databases that involve blend of unstructured (e.g., textual descriptions of products) and structured (e.g., entity relations of products) information. However, previous works have mostly studied textual and relational retrieval tasks as separate topics. To address the gap, we develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Relational Knowledge Bases. We design a novel pipeline to synthesize natural and realistic user queries that integrate diverse relational information and complex textual properties, as well as their ground-truth answers. Moreover, we rigorously conduct human evaluation to validate the quality of our benchmark, which covers a variety of practical applications, including product recommendations, academic paper searches, and precision medicine inquiries. Our benchmark serves as a comprehensive testbed for evaluating the performance of retrieval systems, with an emphasis on retrieval approaches driven by large language models (LLMs). Our experiments suggest that the STARK datasets present significant challenges to the current retrieval and LLM systems, indicating the demand for building more capable retrieval systems that can handle both textual and relational aspects.

* 25 pages, 7 figures

Via

Access Paper or Ask Questions

How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs' internal prior

Apr 16, 2024

Kevin Wu, Eric Wu, James Zou

Figure 1 for How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs' internal prior

Figure 2 for How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs' internal prior

Figure 3 for How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs' internal prior

Figure 4 for How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs' internal prior

Abstract:Retrieval augmented generation (RAG) is often used to fix hallucinations and provide up-to-date knowledge for large language models (LLMs). However, in cases when the LLM alone incorrectly answers a question, does providing the correct retrieved content always fix the error? Conversely, in cases where the retrieved content is incorrect, does the LLM know to ignore the wrong information, or does it recapitulate the error? To answer these questions, we systematically analyze the tug-of-war between a LLM's internal knowledge (i.e. its prior) and the retrieved information in settings when they disagree. We test GPT-4 and other LLMs on question-answering abilities across datasets with and without reference documents. As expected, providing the correct retrieved information fixes most model mistakes (94% accuracy). However, when the reference document is perturbed with increasing levels of wrong values, the LLM is more likely to recite the incorrect, modified information when its internal prior is weaker but is more resistant when its prior is stronger. Similarly, we also find that the more the modified information deviates from the model's prior, the less likely the model is to prefer it. These results highlight an underlying tension between a model's prior knowledge and the information presented in reference documents.

Via

Access Paper or Ask Questions

Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems

Mar 04, 2024

Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, James Zou

Figure 1 for Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems

Figure 2 for Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems

Figure 3 for Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems

Figure 4 for Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems

Abstract:Many recent state-of-the-art results in language tasks were achieved using compound systems that perform multiple Large Language Model (LLM) calls and aggregate their responses. However, there is little understanding of how the number of LLM calls -- e.g., when asking the LLM to answer each question multiple times and taking a consensus -- affects such a compound system's performance. In this paper, we initiate the study of scaling laws of compound inference systems. We analyze, theoretically and empirically, how the number of LLM calls affects the performance of one-layer Voting Inference Systems -- one of the simplest compound systems, which aggregates LLM responses via majority voting. We find empirically that across multiple language tasks, surprisingly, Voting Inference Systems' performance first increases but then decreases as a function of the number of LLM calls. Our theoretical results suggest that this non-monotonicity is due to the diversity of query difficulties within a task: more LLM calls lead to higher performance on "easy" queries, but lower performance on "hard" queries, and non-monotone behavior emerges when a task contains both types of queries. This insight then allows us to compute, from a small number of samples, the number of LLM calls that maximizes system performance, and define a scaling law of Voting Inference Systems. Experiments show that our scaling law can predict the performance of Voting Inference Systems and find the optimal number of LLM calls to make.

Via

Access Paper or Ask Questions

Simple linear attention language models balance the recall-throughput tradeoff

Feb 28, 2024

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, Christopher Ré

Figure 1 for Simple linear attention language models balance the recall-throughput tradeoff

Figure 2 for Simple linear attention language models balance the recall-throughput tradeoff

Figure 3 for Simple linear attention language models balance the recall-throughput tradeoff

Figure 4 for Simple linear attention language models balance the recall-throughput tradeoff

Abstract:Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is bottle-necked during inference by the KV-cache's aggressive memory consumption. In this work, we explore whether we can improve language model efficiency (e.g. by reducing memory consumption) without compromising on recall. By applying experiments and theory to a broad set of architectures, we identify a key tradeoff between a model's state size and recall ability. We show that efficient alternatives to attention (e.g. H3, Mamba, RWKV) maintain a fixed-size recurrent state, but struggle at recall. We propose BASED a simple architecture combining linear and sliding window attention. By varying BASED window size and linear attention feature dimension, we can dial the state size and traverse the pareto frontier of the recall-memory tradeoff curve, recovering the full quality of attention on one end and the small state size of attention-alternatives on the other. We train language models up to 1.3b parameters and show that BASED matches the strongest sub-quadratic models (e.g. Mamba) in perplexity and outperforms them on real-world recall-intensive tasks by 6.22 accuracy points. Implementations of linear attention are often less efficient than optimized standard attention implementations. To make BASED competitive, we develop IO-aware algorithms that enable 24x higher throughput on language generation than FlashAttention-2, when generating 1024 tokens using 1.3b parameter models. Code for this work is provided at: https://github.com/HazyResearch/based.

Via

Access Paper or Ask Questions

Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content

Feb 21, 2024

Federico Bianchi, James Zou

Abstract:The risks derived from large language models (LLMs) generating deceptive and damaging content have been the subject of considerable research, but even safe generations can lead to problematic downstream impacts. In our study, we shift the focus to how even safe text coming from LLMs can be easily turned into potentially dangerous content through Bait-and-Switch attacks. In such attacks, the user first prompts LLMs with safe questions and then employs a simple find-and-replace post-hoc technique to manipulate the outputs into harmful narratives. The alarming efficacy of this approach in generating toxic content highlights a significant challenge in developing reliable safety guardrails for LLMs. In particular, we stress that focusing on the safety of the verbatim LLM outputs is insufficient and that we also need to consider post-hoc transformations.

Via

Access Paper or Ask Questions

Prospector Heads: Generalized Feature Attribution for Large Models & Data

Feb 18, 2024

Gautam Machiraju, Alexander Derry, Arjun Desai, Neel Guha, Amir-Hossein Karimi, James Zou, Russ Altman, Christopher Ré, Parag Mallick

Abstract:Feature attribution, the ability to localize regions of the input data that are relevant for classification, is an important capability for machine learning models in scientific and biomedical domains. Current methods for feature attribution, which rely on "explaining" the predictions of end-to-end classifiers, suffer from imprecise feature localization and are inadequate for use with small sample sizes and high-dimensional datasets due to computational challenges. We introduce prospector heads, an efficient and interpretable alternative to explanation-based methods for feature attribution that can be applied to any encoder and any data modality. Prospector heads generalize across modalities through experiments on sequences (text), images (pathology), and graphs (protein structures), outperforming baseline attribution methods by up to 49 points in mean localization AUPRC. We also demonstrate how prospector heads enable improved interpretation and discovery of class-specific patterns in the input data. Through their high performance, flexibility, and generalizability, prospectors provide a framework for improving trust and transparency for machine learning models in complex domains.

* 22 pages, 12 figures, 6 tables

Via

Access Paper or Ask Questions

How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis

Feb 08, 2024

Federico Bianchi, Patrick John Chia, Mert Yuksekgonul, Jacopo Tagliabue, Dan Jurafsky, James Zou

Figure 1 for How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis

Figure 2 for How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis

Figure 3 for How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis

Figure 4 for How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis

Abstract:Negotiation is the basis of social interactions; humans negotiate everything from the price of cars to how to share common resources. With rapidly growing interest in using large language models (LLMs) to act as agents on behalf of human users, such LLM agents would also need to be able to negotiate. In this paper, we study how well LLMs can negotiate with each other. We develop NegotiationArena: a flexible framework for evaluating and probing the negotiation abilities of LLM agents. We implemented three types of scenarios in NegotiationArena to assess LLM's behaviors in allocating shared resources (ultimatum games), aggregate resources (trading games) and buy/sell goods (price negotiations). Each scenario allows for multiple turns of flexible dialogues between LLM agents to allow for more complex negotiations. Interestingly, LLM agents can significantly boost their negotiation outcomes by employing certain behavioral tactics. For example, by pretending to be desolate and desperate, LLMs can improve their payoffs by 20\% when negotiating against the standard GPT-4. We also quantify irrational negotiation behaviors exhibited by the LLM agents, many of which also appear in humans. Together, \NegotiationArena offers a new environment to investigate LLM interactions, enabling new insights into LLM's theory of mind, irrationality, and reasoning abilities.

Via

Access Paper or Ask Questions