Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Robin Jia

Mechanistic Interpretability of Emotion Inference in Large Language Models

Feb 08, 2025

Ala N. Tak, Amin Banayeeanzade, Anahita Bolourani, Mina Kian, Robin Jia, Jonathan Gratch

Abstract:Large language models (LLMs) show promising capabilities in predicting human emotions from text. However, the mechanisms through which these models process emotional stimuli remain largely unexplored. Our study addresses this gap by investigating how autoregressive LLMs infer emotions, showing that emotion representations are functionally localized to specific regions in the model. Our evaluation includes diverse model families and sizes and is supported by robustness checks. We then show that the identified representations are psychologically plausible by drawing on cognitive appraisal theory, a well-established psychological framework positing that emotions emerge from evaluations (appraisals) of environmental stimuli. By causally intervening on construed appraisal concepts, we steer the generation and show that the outputs align with theoretical and intuitive expectations. This work highlights a novel way to causally intervene and precisely shape emotional text generation, potentially benefiting safety and alignment in sensitive affective domains.

* To be submitted to the Association for Computational Linguistics (ACL 2025)

Via

Access Paper or Ask Questions

Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics

Jan 24, 2025

Ameya Godbole, Robin Jia

Figure 1 for Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics

Figure 2 for Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics

Figure 3 for Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics

Figure 4 for Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics

Abstract:Improvements in large language models have led to increasing optimism that they can serve as reliable evaluators of natural language generation outputs. In this paper, we challenge this optimism by thoroughly re-evaluating five state-of-the-art factuality metrics on a collection of 11 datasets for summarization, retrieval-augmented generation, and question answering. We find that these evaluators are inconsistent with each other and often misestimate system-level performance, both of which can lead to a variety of pitfalls. We further show that these metrics exhibit biases against highly paraphrased outputs and outputs that draw upon faraway parts of the source documents. We urge users of these factuality metrics to proceed with caution and manually validate the reliability of these metrics in their domain of interest before proceeding.

Via

Access Paper or Ask Questions

TLDR: Token-Level Detective Reward Model for Large Vision Language Models

Oct 07, 2024

Deqing Fu, Tong Xiao, Rui Wang, Wang Zhu, Pengchuan Zhang, Guan Pang, Robin Jia, Lawrence Chen

Figure 1 for TLDR: Token-Level Detective Reward Model for Large Vision Language Models

Figure 2 for TLDR: Token-Level Detective Reward Model for Large Vision Language Models

Figure 3 for TLDR: Token-Level Detective Reward Model for Large Vision Language Models

Figure 4 for TLDR: Token-Level Detective Reward Model for Large Vision Language Models

Abstract:Although reward models have been successful in improving multimodal large language models, the reward models themselves remain brutal and contain minimal information. Notably, existing reward models only mimic human annotations by assigning only one binary feedback to any text, no matter how long the text is. In the realm of multimodal language models, where models are required to process both images and texts, a naive reward model may learn implicit biases toward texts and become less grounded in images. In this paper, we propose a $\textbf{T}$oken-$\textbf{L}$evel $\textbf{D}$etective $\textbf{R}$eward Model ($\textbf{TLDR}$) to provide fine-grained annotations to each text token. We first introduce a perturbation-based method to generate synthetic hard negatives and their token-level labels to train TLDR models. Then we show the rich usefulness of TLDR models both in assisting off-the-shelf models to self-correct their generations, and in serving as a hallucination evaluation tool. Finally, we show that TLDR models can significantly speed up human annotation by 3 times to acquire a broader range of high-quality vision language data.

* Work done at Meta

Via

Access Paper or Ask Questions

Rethinking Backdoor Detection Evaluation for Language Models

Aug 31, 2024

Jun Yan, Wenjie Jacky Mo, Xiang Ren, Robin Jia

Figure 1 for Rethinking Backdoor Detection Evaluation for Language Models

Figure 2 for Rethinking Backdoor Detection Evaluation for Language Models

Figure 3 for Rethinking Backdoor Detection Evaluation for Language Models

Figure 4 for Rethinking Backdoor Detection Evaluation for Language Models

Abstract:Backdoor attacks, in which a model behaves maliciously when given an attacker-specified trigger, pose a major security risk for practitioners who depend on publicly released language models. Backdoor detection methods aim to detect whether a released model contains a backdoor, so that practitioners can avoid such vulnerabilities. While existing backdoor detection methods have high accuracy in detecting backdoored models on standard benchmarks, it is unclear whether they can robustly identify backdoors in the wild. In this paper, we examine the robustness of backdoor detectors by manipulating different factors during backdoor planting. We find that the success of existing methods highly depends on how intensely the model is trained on poisoned data during backdoor planting. Specifically, backdoors planted with either more aggressive or more conservative training are significantly more difficult to detect than the default ones. Our results highlight a lack of robustness of existing backdoor detectors and the limitations in current benchmark construction.

Via

Access Paper or Ask Questions

When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models

Jun 19, 2024

Ting-Yun Chang, Jesse Thomason, Robin Jia

Figure 1 for When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models

Figure 2 for When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models

Figure 3 for When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models

Figure 4 for When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models

Abstract:This paper studies in-context learning (ICL) by decomposing the output of large language models into the individual contributions of attention heads and MLPs (components). We observe curious components: good-performing ones that individually do well on a classification task, even when the model performs poorly; bad-performing ones that do much worse than chance; and label-biased components that always predict the same label. We find that component accuracies are well-correlated across different demonstration sets and perturbations of prompt templates, even when the full-model accuracy varies greatly. Based on our findings, we propose component reweighting, which learns to linearly re-scale the component activations from a few labeled examples. Given 24 labeled examples, our method improves by an average of 6.0% accuracy points over 24-shot ICL across 8 tasks on Llama-2-7B. Overall, this paper both enriches our understanding of ICL and provides a practical method for improvement by examining model internals.

Via

Access Paper or Ask Questions

Pre-trained Large Language Models Use Fourier Features to Compute Addition

Jun 05, 2024

Tianyi Zhou, Deqing Fu, Vatsal Sharan, Robin Jia

Figure 1 for Pre-trained Large Language Models Use Fourier Features to Compute Addition

Figure 2 for Pre-trained Large Language Models Use Fourier Features to Compute Addition

Figure 3 for Pre-trained Large Language Models Use Fourier Features to Compute Addition

Figure 4 for Pre-trained Large Language Models Use Fourier Features to Compute Addition

Abstract:Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities, yet how they compute basic arithmetic, such as addition, remains unclear. This paper shows that pre-trained LLMs add numbers using Fourier features -- dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain. Within the model, MLP and attention layers use Fourier features in complementary ways: MLP layers primarily approximate the magnitude of the answer using low-frequency features, while attention layers primarily perform modular addition (e.g., computing whether the answer is even or odd) using high-frequency features. Pre-training is crucial for this mechanism: models trained from scratch to add numbers only exploit low-frequency features, leading to lower accuracy. Introducing pre-trained token embeddings to a randomly initialized model rescues its performance. Overall, our analysis demonstrates that appropriate pre-trained representations (e.g., Fourier features) can unlock the ability of Transformers to learn precise mechanisms for algorithmic tasks.

Via

Access Paper or Ask Questions

Language Models can Infer Action Semantics for Classical Planners from Environment Feedback

Jun 04, 2024

Wang Zhu, Ishika Singh, Robin Jia, Jesse Thomason

Abstract:Classical planning approaches guarantee finding a set of actions that can achieve a given goal state when possible, but require an expert to specify logical action semantics that govern the dynamics of the environment. Researchers have shown that Large Language Models (LLMs) can be used to directly infer planning steps based on commonsense knowledge and minimal domain information alone, but such plans often fail on execution. We bring together the strengths of classical planning and LLM commonsense inference to perform domain induction, learning and validating action pre- and post-conditions based on closed-loop interactions with the environment itself. We propose PSALM, which leverages LLM inference to heuristically complete partial plans emitted by a classical planner given partial domain knowledge, as well as to infer the semantic rules of the domain in a logical language based on environment feedback after execution. Our analysis on 7 environments shows that with just one expert-curated example plans, using LLMs as heuristic planners and rule predictors achieves lower environment execution steps and environment resets than random exploration while simultaneously recovering the underlying ground truth action semantics of the domain.

Via

Access Paper or Ask Questions

IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations

Apr 02, 2024

Deqing Fu, Ghazal Khalighinejad, Ollie Liu, Bhuwan Dhingra, Dani Yogatama, Robin Jia, Willie Neiswanger

Figure 1 for IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations

Figure 2 for IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations

Figure 3 for IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations

Figure 4 for IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations

Abstract:Current foundation models exhibit impressive capabilities when prompted either with text only or with both image and text inputs. But do their capabilities change depending on the input modality? In this work, we propose $\textbf{IsoBench}$, a benchmark dataset containing problems from four major areas: math, science, algorithms, and games. Each example is presented with multiple $\textbf{isomorphic representations}$ of inputs, such as visual, textual, and mathematical presentations. IsoBench provides fine-grained feedback to diagnose performance gaps caused by the form of the representation. Across various foundation models, we observe that on the same problem, models have a consistent preference towards textual representations. Most prominently, when evaluated on all IsoBench problems, Claude-3 Opus performs 28.7 points worse when provided with images instead of text; similarly, GPT-4 Turbo is 18.7 points worse and Gemini Pro is 14.9 points worse. Finally, we present two prompting techniques, $\textit{IsoCombination}$ and $\textit{IsoScratchPad}$, which improve model performance by considering combinations of, and translations between, different input representations.

Via

Access Paper or Ask Questions

Proving membership in LLM pretraining data via data watermarks

Feb 16, 2024

Johnny Tian-Zheng Wei, Ryan Yixiang Wang, Robin Jia

Figure 1 for Proving membership in LLM pretraining data via data watermarks

Figure 2 for Proving membership in LLM pretraining data via data watermarks

Figure 3 for Proving membership in LLM pretraining data via data watermarks

Figure 4 for Proving membership in LLM pretraining data via data watermarks

Abstract:Detecting whether copyright holders' works were used in LLM pretraining is poised to be an important problem. This work proposes using data watermarks to enable principled detection with only black-box model access, provided that the rightholder contributed multiple training documents and watermarked them before public release. By applying a randomly sampled data watermark, detection can be framed as hypothesis testing, which provides guarantees on the false detection rate. We study two watermarks: one that inserts random sequences, and another that randomly substitutes characters with Unicode lookalikes. We first show how three aspects of watermark design -- watermark length, number of duplications, and interference -- affect the power of the hypothesis test. Next, we study how a watermark's detection strength changes under model and dataset scaling: while increasing the dataset size decreases the strength of the watermark, watermarks remain strong if the model size also increases. Finally, we view SHA hashes as natural watermarks and show that we can robustly detect hashes from BLOOM-176B's training data, as long as they occurred at least 90 times. Together, our results point towards a promising future for data watermarks in real world use.

Via

Access Paper or Ask Questions

Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?

Dec 02, 2023

Wang Zhu, Ishika Singh, Yuan Huang, Robin Jia, Jesse Thomason

Figure 1 for Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?

Figure 2 for Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?

Figure 3 for Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?

Figure 4 for Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?

Abstract:Data augmentation via back-translation is common when pretraining Vision-and-Language Navigation (VLN) models, even though the generated instructions are noisy. But: does that noise matter? We find that nonsensical or irrelevant language instructions during pretraining can have little effect on downstream performance for both HAMT and VLN-BERT on R2R, and is still better than only using clean, human data. To underscore these results, we concoct an efficient augmentation method, Unigram + Object, which generates nonsensical instructions that nonetheless improve downstream performance. Our findings suggest that what matters for VLN R2R pretraining is the quantity of visual trajectories, not the quality of instructions.

* Accepted by O-DRUM @ CVPR 2023

Via

Access Paper or Ask Questions