Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Stuart Russell

Berkeley

Social Choice for AI Alignment: Dealing with Diverse Human Feedback

Apr 16, 2024

Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H. Holliday, Bob M. Jacobs, Nathan Lambert, Milan Mossé, Eric Pacuit, Stuart Russell, Hailey Schoelkopf(+2 more)

Figure 1 for Social Choice for AI Alignment: Dealing with Diverse Human Feedback

Figure 2 for Social Choice for AI Alignment: Dealing with Diverse Human Feedback

Figure 3 for Social Choice for AI Alignment: Dealing with Diverse Human Feedback

Figure 4 for Social Choice for AI Alignment: Dealing with Diverse Human Feedback

Abstract:Foundation models such as GPT-4 are fine-tuned to avoid unsafe or otherwise problematic behavior, so that, for example, they refuse to comply with requests for help with committing crimes or with producing racist text. One approach to fine-tuning, called reinforcement learning from human feedback, learns from humans' expressed preferences over multiple outputs. Another approach is constitutional AI, in which the input from humans is a list of high-level principles. But how do we deal with potentially diverging input from humans? How can we aggregate the input into consistent data about ''collective'' preferences or otherwise use it to make collective choices about model behavior? In this paper, we argue that the field of social choice is well positioned to address these questions, and we discuss ways forward for this agenda, drawing on discussions in a recent workshop on Social Choice for AI Ethics and Safety held in Berkeley, CA, USA in December 2023.

* 15 pages, 4 figures

Via

Access Paper or Ask Questions

When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Mar 03, 2024

Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons

Figure 1 for When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Figure 2 for When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Figure 3 for When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Figure 4 for When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Abstract:Past analyses of reinforcement learning from human feedback (RLHF) assume that the human fully observes the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deception and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. To help address these issues, we mathematically characterize how partial observability of the environment translates into (lack of) ambiguity in the learned return function. In some cases, accounting for partial observability makes it theoretically possible to recover the return function and thus the optimal policy, while in other cases, there is irreducible ambiguity. We caution against blindly applying RLHF in partially observable settings and propose research directions to help tackle these challenges.

Via

Access Paper or Ask Questions

Avoiding Catastrophe in Continuous Spaces by Asking for Help

Feb 12, 2024

Benjamin Plaut, Hanlin Zhu, Stuart Russell

Abstract:Most reinforcement learning algorithms with formal regret guarantees assume all mistakes are reversible and rely on essentially trying all possible options. This approach leads to poor outcomes when some mistakes are irreparable or even catastrophic. We propose a variant of the contextual bandit problem where the goal is to minimize the chance of catastrophe. Specifically, we assume that the payoff each round represents the chance of avoiding catastrophe that round, and try to maximize the product of payoffs (the overall chance of avoiding catastrophe). To give the agent some chance of success, we allow a limited number of queries to a mentor and assume a Lipschitz continuous payoff function. We present an algorithm whose regret and rate of querying the mentor both approach 0 as the time horizon grows, assuming a continuous 1D state space and a relatively "simple" payoff function. We also provide a matching lower bound: without the simplicity assumption: any algorithm either constantly asks for help or is nearly guaranteed to cause catastrophe. Finally, we identify the key obstacle to generalizing our algorithm to a multi-dimensional state space.

Via

Access Paper or Ask Questions

ALMANACS: A Simulatability Benchmark for Language Model Explainability

Dec 20, 2023

Edmund Mills, Shiye Su, Stuart Russell, Scott Emmons

Figure 1 for ALMANACS: A Simulatability Benchmark for Language Model Explainability

Figure 2 for ALMANACS: A Simulatability Benchmark for Language Model Explainability

Figure 3 for ALMANACS: A Simulatability Benchmark for Language Model Explainability

Figure 4 for ALMANACS: A Simulatability Benchmark for Language Model Explainability

Abstract:How do we measure the efficacy of language model explainability methods? While many explainability methods have been developed, they are typically evaluated on bespoke tasks, preventing an apples-to-apples comparison. To help fill this gap, we present ALMANACS, a language model explainability benchmark. ALMANACS scores explainability methods on simulatability, i.e., how well the explanations improve behavior prediction on new inputs. The ALMANACS scenarios span twelve safety-relevant topics such as ethical reasoning and advanced AI behaviors; they have idiosyncratic premises to invoke model-specific behavior; and they have a train-test distributional shift to encourage faithful explanations. By using another language model to predict behavior based on the explanations, ALMANACS is a fully automated benchmark. We use ALMANACS to evaluate counterfactuals, rationalizations, attention, and Integrated Gradients explanations. Our results are sobering: when averaged across all topics, no explanation method outperforms the explanation-free control. We conclude that despite modest successes in prior work, developing an explanation method that aids simulatability in ALMANACS remains an open challenge.

* Code is available at https://github.com/edmundmills/ALMANACS}{https://github.com/edmundmills/ALMANACS

Via

Access Paper or Ask Questions

The Effective Horizon Explains Deep RL Performance in Stochastic Environments

Dec 13, 2023

Cassidy Laidlaw, Banghua Zhu, Stuart Russell, Anca Dragan

Figure 1 for The Effective Horizon Explains Deep RL Performance in Stochastic Environments

Figure 2 for The Effective Horizon Explains Deep RL Performance in Stochastic Environments

Figure 3 for The Effective Horizon Explains Deep RL Performance in Stochastic Environments

Figure 4 for The Effective Horizon Explains Deep RL Performance in Stochastic Environments

Abstract:Reinforcement learning (RL) theory has largely focused on proving minimax sample complexity bounds. These require strategic exploration algorithms that use relatively limited function classes for representing the policy or value function. Our goal is to explain why deep RL algorithms often perform well in practice, despite using random exploration and much more expressive function classes like neural networks. Our work arrives at an explanation by showing that many stochastic MDPs can be solved by performing only a few steps of value iteration on the random policy's Q function and then acting greedily. When this is true, we find that it is possible to separate the exploration and learning components of RL, making it much easier to analyze. We introduce a new RL algorithm, SQIRL, that iteratively learns a near-optimal policy by exploring randomly to collect rollouts and then performing a limited number of steps of fitted-Q iteration over those rollouts. Any regression algorithm that satisfies basic in-distribution generalization properties can be used in SQIRL to efficiently solve common MDPs. This can explain why deep RL works neural networks, since it is empirically established that neural networks generalize well in-distribution. Furthermore, SQIRL explains why random exploration works well in practice, since we show many environments can be solved by estimating the random policy's Q-function and then applying zero or a few steps of value iteration. We leverage SQIRL to derive instance-dependent sample complexity bounds for RL that are exponential only in an "effective horizon" of lookahead and on the complexity of the class used for function approximation. Empirically, we also find that SQIRL performance strongly correlates with PPO and DQN performance in a variety of stochastic environments, supporting that our theoretical analysis is predictive of practical performance.

Via

Access Paper or Ask Questions

Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

Nov 02, 2023

Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell(+2 more)

Abstract:While Large Language Models (LLMs) are increasingly being used in real-world applications, they remain vulnerable to prompt injection attacks: malicious third party prompts that subvert the intent of the system designer. To help researchers study this problem, we present a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection, all created by players of an online game called Tensor Trust. To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs. The attacks in our dataset have a lot of easily interpretable stucture, and shed light on the weaknesses of LLMs. We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking. Our benchmark results show that many models are vulnerable to the attack strategies in the Tensor Trust dataset. Furthermore, we show that some attack strategies from the dataset generalize to deployed LLM-based applications, even though they have a very different set of constraints to the game. We release all data and source code at https://tensortrust.ai/paper

Via

Access Paper or Ask Questions

Managing AI Risks in an Era of Rapid Progress

Oct 26, 2023

Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield(+14 more)

Abstract:In this short consensus paper, we outline risks from upcoming, advanced AI systems. We examine large-scale social harms and malicious uses, as well as an irreversible loss of human control over autonomous AI systems. In light of rapid and continuing AI progress, we propose priorities for AI R&D and governance.

Via

Access Paper or Ask Questions

Active teacher selection for reinforcement learning from human feedback

Oct 23, 2023

Rachel Freedman, Justin Svegliato, Kyle Wray, Stuart Russell

Figure 1 for Active teacher selection for reinforcement learning from human feedback

Figure 2 for Active teacher selection for reinforcement learning from human feedback

Figure 3 for Active teacher selection for reinforcement learning from human feedback

Figure 4 for Active teacher selection for reinforcement learning from human feedback

Abstract:Reinforcement learning from human feedback (RLHF) enables machine learning systems to learn objectives from human feedback. A core limitation of these systems is their assumption that all feedback comes from a single human teacher, despite querying a range of distinct teachers. We propose the Hidden Utility Bandit (HUB) framework to model differences in teacher rationality, expertise, and costliness, formalizing the problem of learning from multiple teachers. We develop a variety of solution algorithms and apply them to two real-world domains: paper recommendation systems and COVID-19 vaccine testing. We find that the Active Teacher Selection (ATS) algorithm outperforms baseline algorithms by actively selecting when and which teacher to query. The HUB framework and ATS algorithm demonstrate the importance of leveraging differences between teachers to learn accurate reward models, facilitating future research on active teacher selection for robust reward modeling.

Via

Access Paper or Ask Questions

On Representation Complexity of Model-based and Model-free Reinforcement Learning

Oct 03, 2023

Hanlin Zhu, Baihe Huang, Stuart Russell

Figure 1 for On Representation Complexity of Model-based and Model-free Reinforcement Learning

Figure 2 for On Representation Complexity of Model-based and Model-free Reinforcement Learning

Figure 3 for On Representation Complexity of Model-based and Model-free Reinforcement Learning

Figure 4 for On Representation Complexity of Model-based and Model-free Reinforcement Learning

Abstract:We study the representation complexity of model-based and model-free reinforcement learning (RL) in the context of circuit complexity. We prove theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal $Q$-function suffers an exponential circuit complexity in constant-depth circuits. By drawing attention to the approximation errors and building connections to complexity theory, our theory provides unique insights into why model-based algorithms usually enjoy better sample complexity than model-free algorithms from a novel representation complexity perspective: in some cases, the ground-truth rule (model) of the environment is simple to represent, while other quantities, such as $Q$-function, appear complex. We empirically corroborate our theory by comparing the approximation error of the transition kernel, reward function, and optimal $Q$-function in various Mujoco environments, which demonstrates that the approximation errors of the transition kernel and reward function are consistently lower than those of the optimal $Q$-function. To the best of our knowledge, this work is the first to study the circuit complexity of RL, which also provides a rigorous framework for future research.

* 23 pages, 3 figures

Via

Access Paper or Ask Questions

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Sep 18, 2023

Luke Bailey, Euan Ong, Stuart Russell, Scott Emmons

Figure 1 for Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Figure 2 for Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Figure 3 for Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Figure 4 for Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Abstract:Are foundation models secure from malicious actors? In this work, we focus on the image input to a vision-language model (VLM). We discover image hijacks, adversarial images that control generative models at runtime. We introduce Behaviour Matching, a general method for creating image hijacks, and we use it to explore three types of attacks. Specific string attacks generate arbitrary output of the adversary's choice. Leak context attacks leak information from the context window into the output. Jailbreak attacks circumvent a model's safety training. We study these attacks against LLaVA, a state-of-the-art VLM based on CLIP and LLaMA-2, and find that all our attack types have above a 90% success rate. Moreover, our attacks are automated and require only small image perturbations. These findings raise serious concerns about the security of foundation models. If image hijacks are as difficult to defend against as adversarial examples in CIFAR-10, then it might be many years before a solution is found -- if it even exists.

* Project page at https://image-hijacks.github.io

Via

Access Paper or Ask Questions