Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Martin Wattenberg

Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner

Jun 17, 2024

Kenneth Li, Yiming Wang, Fernanda Viégas, Martin Wattenberg

Figure 1 for Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner

Figure 2 for Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner

Figure 3 for Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner

Figure 4 for Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner

Abstract:We present an approach called Dialogue Action Tokens (DAT) that adapts language model agents to plan goal-directed dialogues. The core idea is to treat each utterance as an action, thereby converting dialogues into games where existing approaches such as reinforcement learning can be applied. Specifically, we freeze a pretrained language model and train a small planner model that predicts a continuous action vector, used for controlled generation in each round. This design avoids the problem of language degradation under reward optimization. When evaluated on the Sotopia platform for social simulations, the DAT-steered LLaMA model surpasses GPT-4's performance. We also apply DAT to steer an attacker language model in a novel multi-turn red-teaming setting, revealing a potential new attack surface.

* Code: https://github.com/likenneth/dialogue_action_token

Via

Access Paper or Ask Questions

Designing a Dashboard for Transparency and Control of Conversational AI

Jun 12, 2024

Yida Chen, Aoyu Wu, Trevor DePodesta, Catherine Yeh, Kenneth Li, Nicholas Castillo Marin, Oam Patel, Jan Riecke, Shivam Raval, Olivia Seow(+2 more)

Figure 1 for Designing a Dashboard for Transparency and Control of Conversational AI

Figure 2 for Designing a Dashboard for Transparency and Control of Conversational AI

Figure 3 for Designing a Dashboard for Transparency and Control of Conversational AI

Figure 4 for Designing a Dashboard for Transparency and Control of Conversational AI

Abstract:Conversational LLMs function as black box systems, leaving users guessing about why they see the output they do. This lack of transparency is potentially problematic, especially given concerns around bias and truthfulness. To address this issue, we present an end-to-end prototype-connecting interpretability techniques with user experience design-that seeks to make chatbots more transparent. We begin by showing evidence that a prominent open-source LLM has a "user model": examining the internal state of the system, we can extract data related to a user's age, gender, educational level, and socioeconomic status. Next, we describe the design of a dashboard that accompanies the chatbot interface, displaying this user model in real time. The dashboard can also be used to control the user model and the system's behavior. Finally, we discuss a study in which users conversed with the instrumented system. Our results suggest that users appreciate seeing internal states, which helped them expose biased behavior and increased their sense of control. Participants also made valuable suggestions that point to future directions for both design and machine learning research. The project page and video demo of our TalkTuner system are available at https://bit.ly/talktuner-project-page

* Project page: https://bit.ly/talktuner-project-page 38 pages, 23 figures

Via

Access Paper or Ask Questions

Q-Probe: A Lightweight Approach to Reward Maximization for Language Models

Feb 22, 2024

Kenneth Li, Samy Jelassi, Hugh Zhang, Sham Kakade, Martin Wattenberg, David Brandfonbrener

Figure 1 for Q-Probe: A Lightweight Approach to Reward Maximization for Language Models

Figure 2 for Q-Probe: A Lightweight Approach to Reward Maximization for Language Models

Figure 3 for Q-Probe: A Lightweight Approach to Reward Maximization for Language Models

Figure 4 for Q-Probe: A Lightweight Approach to Reward Maximization for Language Models

Abstract:We present an approach called Q-probing to adapt a pre-trained language model to maximize a task-specific reward function. At a high level, Q-probing sits between heavier approaches such as finetuning and lighter approaches such as few shot prompting, but can also be combined with either. The idea is to learn a simple linear function on a model's embedding space that can be used to reweight candidate completions. We theoretically show that this sampling procedure is equivalent to a KL-constrained maximization of the Q-probe as the number of samples increases. To train the Q-probes we consider either reward modeling or a class of novel direct policy learning objectives based on importance weighted policy gradients. With this technique, we see gains in domains with ground-truth rewards (code generation) as well as implicit rewards defined by preference data, even outperforming finetuning in data-limited regimes. Moreover, a Q-probe can be trained on top of an API since it only assumes access to sampling and embeddings. Code: https://github.com/likenneth/q_probe .

Via

Access Paper or Ask Questions

Measuring and Controlling Persona Drift in Language Model Dialogs

Feb 13, 2024

Kenneth Li, Tianle Liu, Naomi Bashkansky, David Bau, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg

Figure 1 for Measuring and Controlling Persona Drift in Language Model Dialogs

Figure 2 for Measuring and Controlling Persona Drift in Language Model Dialogs

Figure 3 for Measuring and Controlling Persona Drift in Language Model Dialogs

Figure 4 for Measuring and Controlling Persona Drift in Language Model Dialogs

Abstract:Prompting is a standard tool for customizing language-model chatbots, enabling them to take on a specific "persona". An implicit assumption in the use of prompts is that they will be stable, so the chatbot will continue to generate text according to the stipulated persona for the duration of a conversation. We propose a quantitative benchmark to test this assumption, evaluating persona stability via self-chats between two personalized chatbots. Testing popular models like LLaMA2-chat-70B, we reveal a significant persona drift within eight rounds of conversations. An empirical and theoretical analysis of this phenomenon suggests the transformer attention mechanism plays a role, due to attention decay over long exchanges. To combat attention decay and persona drift, we propose a lightweight method called split-softmax, which compares favorably against two strong baselines.

* Code: https://github.com/likenneth/persona_drift

Via

Access Paper or Ask Questions

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

Jan 03, 2024

Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, Rada Mihalcea

Figure 1 for A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

Figure 2 for A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

Figure 3 for A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

Figure 4 for A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

Abstract:While alignment algorithms are now commonly used to tune pre-trained language models towards a user's preferences, we lack explanations for the underlying mechanisms in which models become ``aligned'', thus making it difficult to explain phenomena like jailbreaks. In this work we study a popular algorithm, direct preference optimization (DPO), and the mechanisms by which it reduces toxicity. Namely, we first study how toxicity is represented and elicited in a pre-trained language model, GPT2-medium. We then apply DPO with a carefully crafted pairwise dataset to reduce toxicity. We examine how the resulting model averts toxic outputs, and find that capabilities learned from pre-training are not removed, but rather bypassed. We use this insight to demonstrate a simple method to un-align the model, reverting it back to its toxic behavior.

Via

Access Paper or Ask Questions

AI Alignment in the Design of Interactive AI: Specification Alignment, Process Alignment, and Evaluation Support

Oct 23, 2023

Michael Terry, Chinmay Kulkarni, Martin Wattenberg, Lucas Dixon, Meredith Ringel Morris

Figure 1 for AI Alignment in the Design of Interactive AI: Specification Alignment, Process Alignment, and Evaluation Support

Figure 2 for AI Alignment in the Design of Interactive AI: Specification Alignment, Process Alignment, and Evaluation Support

Figure 3 for AI Alignment in the Design of Interactive AI: Specification Alignment, Process Alignment, and Evaluation Support

Abstract:AI alignment considers the overall problem of ensuring an AI produces desired outcomes, without undesirable side effects. While often considered from the perspectives of safety and human values, AI alignment can also be considered in the context of designing and evaluating interfaces for interactive AI systems. This paper maps concepts from AI alignment onto a basic, three step interaction cycle, yielding a corresponding set of alignment objectives: 1) specification alignment: ensuring the user can efficiently and reliably communicate objectives to the AI, 2) process alignment: providing the ability to verify and optionally control the AI's execution process, and 3) evaluation support: ensuring the user can verify and understand the AI's output. We also introduce the concepts of a surrogate process, defined as a simplified, separately derived, but controllable representation of the AI's actual process; and the notion of a Process Gulf, which highlights how differences between human and AI processes can lead to challenges in AI control. To illustrate the value of this framework, we describe commercial and research systems along each of the three alignment dimensions, and show how interfaces that provide interactive alignment mechanisms can lead to qualitatively different and improved user experiences.

Via

Access Paper or Ask Questions

ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing

Sep 17, 2023

Ian Arawjo, Chelse Swoopes, Priyan Vaithilingam, Martin Wattenberg, Elena Glassman

Figure 1 for ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing

Figure 2 for ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing

Figure 3 for ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing

Figure 4 for ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing

Abstract:Evaluating outputs of large language models (LLMs) is challenging, requiring making -- and making sense of -- many responses. Yet tools that go beyond basic prompting tend to require knowledge of programming APIs, focus on narrow domains, or are closed-source. We present ChainForge, an open-source visual toolkit for prompt engineering and on-demand hypothesis testing of text generation LLMs. ChainForge provides a graphical interface for comparison of responses across models and prompt variations. Our system was designed to support three tasks: model selection, prompt template design, and hypothesis testing (e.g., auditing). We released ChainForge early in its development and iterated on its design with academics and online users. Through in-lab and interview studies, we find that a range of people could use ChainForge to investigate hypotheses that matter to them, including in real-world settings. We identify three modes of prompt engineering and LLM hypothesis testing: opportunistic exploration, limited evaluation, and iterative refinement.

* 23 pages, 7 figures, in submission

Via

Access Paper or Ask Questions

Emergent Linear Representations in World Models of Self-Supervised Sequence Models

Sep 07, 2023

Neel Nanda, Andrew Lee, Martin Wattenberg

Abstract:How do sequence models represent their decision-making process? Prior work suggests that Othello-playing neural network learned nonlinear models of the board state (Li et al., 2023). In this work, we provide evidence of a closely related linear representation of the board. In particular, we show that probing for "my colour" vs. "opponent's colour" may be a simple yet powerful way to interpret the model's internal state. This precise understanding of the internal representations allows us to control the model's behaviour with simple vector arithmetic. Linear representations enable significant interpretability progress, which we demonstrate with further exploration of how the world model is computed.

Via

Access Paper or Ask Questions

Linearity of Relation Decoding in Transformer Language Models

Aug 17, 2023

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, David Bau

Abstract:Much of the knowledge encoded in transformer language models (LMs) may be expressed in terms of relations: relations between words and their synonyms, entities and their attributes, etc. We show that, for a subset of relations, this computation is well-approximated by a single linear transformation on the subject representation. Linear relation representations may be obtained by constructing a first-order approximation to the LM from a single prompt, and they exist for a variety of factual, commonsense, and linguistic relations. However, we also identify many cases in which LM predictions capture relational knowledge accurately, but this knowledge is not linearly encoded in their representations. Our results thus reveal a simple, interpretable, but heterogeneously deployed knowledge representation strategy in transformer LMs.

Via

Access Paper or Ask Questions

Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model

Jun 09, 2023

Yida Chen, Fernanda Viégas, Martin Wattenberg

Figure 1 for Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model

Figure 2 for Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model

Figure 3 for Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model

Figure 4 for Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model

Abstract:Latent diffusion models (LDMs) exhibit an impressive ability to produce realistic images, yet the inner workings of these models remain mysterious. Even when trained purely on images without explicit depth information, they typically output coherent pictures of 3D scenes. In this work, we investigate a basic interpretability question: does an LDM create and use an internal representation of simple scene geometry? Using linear probes, we find evidence that the internal activations of the LDM encode linear representations of both 3D depth data and a salient-object / background distinction. These representations appear surprisingly early in the denoising process$-$well before a human can easily make sense of the noisy images. Intervention experiments further indicate these representations play a causal role in image synthesis, and may be used for simple high-level editing of an LDM's output.

* 17 pages, 13 figures

Via

Access Paper or Ask Questions