Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Adam Pearce

Auditing language models for hidden objectives

Mar 14, 2025

Samuel Marks, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Siddharth Mishra-Sharma, Daniel Ziegler, Emmanuel Ameisen, Joshua Batson, Tim Belonax(+25 more)

Abstract:We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model's hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model's hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.

Via

Access Paper or Ask Questions

M-SET: Multi-Drone Swarm Intelligence Experimentation with Collision Avoidance Realism

Jun 16, 2024

Chuhao Qin, Alexander Robins, Callum Lillywhite-Roake, Adam Pearce, Hritik Mehta, Scott James, Tsz Ho Wong, Evangelos Pournaras

Figure 1 for M-SET: Multi-Drone Swarm Intelligence Experimentation with Collision Avoidance Realism

Figure 2 for M-SET: Multi-Drone Swarm Intelligence Experimentation with Collision Avoidance Realism

Figure 3 for M-SET: Multi-Drone Swarm Intelligence Experimentation with Collision Avoidance Realism

Figure 4 for M-SET: Multi-Drone Swarm Intelligence Experimentation with Collision Avoidance Realism

Abstract:Distributed sensing by cooperative drone swarms is crucial for several Smart City applications, such as traffic monitoring and disaster response. Using an indoor lab with inexpensive drones, a testbed supports complex and ambitious studies on these systems while maintaining low cost, rigor, and external validity. This paper introduces the Multi-drone Sensing Experimentation Testbed (M-SET), a novel platform designed to prototype, develop, test, and evaluate distributed sensing with swarm intelligence. M-SET addresses the limitations of existing testbeds that fail to emulate collisions, thus lacking realism in outdoor environments. By integrating a collision avoidance method based on a potential field algorithm, M-SET ensures collision-free navigation and sensing, further optimized via a multi-agent collective learning algorithm. Extensive evaluation demonstrates accurate energy consumption estimation and a low risk of collisions, providing a robust proof-of-concept. New insights show that M-SET has significant potential to support ambitious research with minimal cost, simplicity, and high sensing quality.

* 7 pages, 7 figures. This work has been submitted to the IEEE conferenece

Via

Access Paper or Ask Questions

Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models

Jan 12, 2024

Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, Mor Geva

Figure 1 for Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models

Figure 2 for Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models

Figure 3 for Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models

Figure 4 for Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models

Abstract:Inspecting the information encoded in hidden representations of large language models (LLMs) can explain models' behavior and verify their alignment with human values. Given the capabilities of LLMs in generating human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language. We introduce a framework called Patchscopes and show how it can be used to answer a wide range of questions about an LLM's computation. We show that prior interpretability methods based on projecting representations into the vocabulary space and intervening on the LLM computation can be viewed as instances of this framework. Moreover, several of their shortcomings such as failure in inspecting early layers or lack of expressivity can be mitigated by Patchscopes. Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities such as using a more capable model to explain the representations of a smaller model, and unlocks new applications such as self-correction in multi-hop reasoning.

Via

Access Paper or Ask Questions

Acquisition of Chess Knowledge in AlphaZero

Nov 27, 2021

Thomas McGrath, Andrei Kapishnikov, Nenad Tomašev, Adam Pearce, Demis Hassabis, Been Kim, Ulrich Paquet, Vladimir Kramnik

Figure 1 for Acquisition of Chess Knowledge in AlphaZero

Figure 2 for Acquisition of Chess Knowledge in AlphaZero

Figure 3 for Acquisition of Chess Knowledge in AlphaZero

Figure 4 for Acquisition of Chess Knowledge in AlphaZero

Abstract:What is learned by sophisticated neural network agents such as AlphaZero? This question is of both scientific and practical interest. If the representations of strong neural networks bear no resemblance to human concepts, our ability to understand faithful explanations of their decisions will be restricted, ultimately limiting what we can achieve with neural network interpretability. In this work we provide evidence that human knowledge is acquired by the AlphaZero neural network as it trains on the game of chess. By probing for a broad range of human chess concepts we show when and where these concepts are represented in the AlphaZero network. We also provide a behavioural analysis focusing on opening play, including qualitative analysis from chess Grandmaster Vladimir Kramnik. Finally, we carry out a preliminary investigation looking at the low-level details of AlphaZero's representations, and make the resulting behavioural and representational analyses available online.

* 69 pages, 44 figures

Via

Access Paper or Ask Questions

An Interpretability Illusion for BERT

Apr 14, 2021

Tolga Bolukbasi, Adam Pearce, Ann Yuan, Andy Coenen, Emily Reif, Fernanda Viégas, Martin Wattenberg

Figure 1 for An Interpretability Illusion for BERT

Figure 2 for An Interpretability Illusion for BERT

Figure 3 for An Interpretability Illusion for BERT

Figure 4 for An Interpretability Illusion for BERT

Abstract:We describe an "interpretability illusion" that arises when analyzing the BERT model. Activations of individual neurons in the network may spuriously appear to encode a single, simple concept, when in fact they are encoding something far more complex. The same effect holds for linear combinations of activations. We trace the source of this illusion to geometric properties of BERT's embedding space as well as the fact that common text corpora represent only narrow slices of possible English sentences. We provide a taxonomy of model-learned concepts and discuss methodological implications for interpretability research, especially the importance of testing hypotheses on multiple data sets.

Via

Access Paper or Ask Questions

Visualizing and Measuring the Geometry of BERT

Jun 06, 2019

Andy Coenen, Emily Reif, Ann Yuan, Been Kim, Adam Pearce, Fernanda Viégas, Martin Wattenberg

Figure 1 for Visualizing and Measuring the Geometry of BERT

Figure 2 for Visualizing and Measuring the Geometry of BERT

Figure 3 for Visualizing and Measuring the Geometry of BERT

Figure 4 for Visualizing and Measuring the Geometry of BERT

Abstract:Transformer architectures show significant promise for natural language processing. Given that a single pretrained model can be fine-tuned to perform well on many different tasks, these networks appear to extract generally useful linguistic features. A natural question is how such networks represent this information internally. This paper describes qualitative and quantitative investigations of one particularly effective model, BERT. At a high level, linguistic features seem to be represented in separate semantic and syntactic subspaces. We find evidence of a fine-grained geometric representation of word senses. We also present empirical descriptions of syntactic representations in both attention matrices and individual word embeddings, as well as a mathematical argument to explain the geometry of these representations.

* 8 pages, 5 figures

Via

Access Paper or Ask Questions