Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jan Leike

Tony

Hidden Incentives for Auto-Induced Distributional Shift

Sep 19, 2020

David Krueger, Tegan Maharaj, Jan Leike

Figure 1 for Hidden Incentives for Auto-Induced Distributional Shift

Figure 2 for Hidden Incentives for Auto-Induced Distributional Shift

Figure 3 for Hidden Incentives for Auto-Induced Distributional Shift

Figure 4 for Hidden Incentives for Auto-Induced Distributional Shift

Abstract:Decisions made by machine learning systems have increasing influence on the world, yet it is common for machine learning algorithms to assume that no such influence exists. An example is the use of the i.i.d. assumption in content recommendation. In fact, the (choice of) content displayed can change users' perceptions and preferences, or even drive them away, causing a shift in the distribution of users. We introduce the term auto-induced distributional shift (ADS) to describe the phenomenon of an algorithm causing a change in the distribution of its own inputs. Our goal is to ensure that machine learning systems do not leverage ADS to increase performance when doing so could be undesirable. We demonstrate that changes to the learning algorithm, such as the introduction of meta-learning, can cause hidden incentives for auto-induced distributional shift (HI-ADS) to be revealed. To address this issue, we introduce `unit tests' and a mitigation strategy for HI-ADS, as well as a toy environment for modelling real-world issues with HI-ADS in content recommendation, where we demonstrate that strong meta-learners achieve gains in performance via ADS. We show meta-learning and Q-learning both sometimes fail unit tests, but pass when using our mitigation strategy.

Via

Access Paper or Ask Questions

Quantifying Differences in Reward Functions

Jun 24, 2020

Adam Gleave, Michael Dennis, Shane Legg, Stuart Russell, Jan Leike

Figure 1 for Quantifying Differences in Reward Functions

Figure 2 for Quantifying Differences in Reward Functions

Figure 3 for Quantifying Differences in Reward Functions

Figure 4 for Quantifying Differences in Reward Functions

Abstract:For many tasks, the reward function is too complex to be specified procedurally, and must instead be learned from user data. Prior work has evaluated learned reward functions by examining rollouts from a policy optimized for the learned reward. However, this method cannot distinguish between the learned reward function failing to reflect user preferences, and the reinforcement learning algorithm failing to optimize the learned reward. Moreover, the rollout method is highly sensitive to details of the environment the learned reward is evaluated in, which often differ in the deployment environment. To address these problems, we introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly, without training a policy. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy. Furthermore, we find EPIC can be precisely approximated and is more robust than baselines to the choice of visitation distribution. Finally, we find that the EPIC distance of learned reward functions to the ground-truth reward is predictive of the success of training a policy, even in different transition dynamics.

* 8 pages main paper, 29 pages total

Via

Access Paper or Ask Questions

Pitfalls of learning a reward function online

Apr 28, 2020

Stuart Armstrong, Jan Leike, Laurent Orseau, Shane Legg

Figure 1 for Pitfalls of learning a reward function online

Figure 2 for Pitfalls of learning a reward function online

Figure 3 for Pitfalls of learning a reward function online

Figure 4 for Pitfalls of learning a reward function online

Abstract:In some agent designs like inverse reinforcement learning an agent needs to learn its own reward function. Learning the reward function and optimising for it are typically two different processes, usually performed at different stages. We consider a continual (``one life'') learning approach where the agent both learns the reward function and optimises for it at the same time. We show that this comes with a number of pitfalls, such as deliberately manipulating the learning process in one direction, refusing to learn, ``learning'' facts already known to the agent, and making decisions that are strictly dominated (for all relevant reward functions). We formally introduce two desirable properties: the first is `unriggability', which prevents the agent from steering the learning process in the direction of a reward function that is easier to optimise. The second is `uninfluenceability', whereby the reward-function learning process operates by learning facts about the environment. We show that an uninfluenceable process is automatically unriggable, and if the set of possible environments is sufficiently rich, the converse is true too.

Via

Access Paper or Ask Questions

Learning Human Objectives by Evaluating Hypothetical Behavior

Dec 05, 2019

Siddharth Reddy, Anca D. Dragan, Sergey Levine, Shane Legg, Jan Leike

Figure 1 for Learning Human Objectives by Evaluating Hypothetical Behavior

Figure 2 for Learning Human Objectives by Evaluating Hypothetical Behavior

Figure 3 for Learning Human Objectives by Evaluating Hypothetical Behavior

Figure 4 for Learning Human Objectives by Evaluating Hypothetical Behavior

Abstract:We seek to align agent behavior with a user's objectives in a reinforcement learning setting with unknown dynamics, an unknown reward function, and unknown unsafe states. The user knows the rewards and unsafe states, but querying the user is expensive. To address this challenge, we propose an algorithm that safely and interactively learns a model of the user's reward function. We start with a generative model of initial states and a forward dynamics model trained on off-policy data. Our method uses these models to synthesize hypothetical behaviors, asks the user to label the behaviors with rewards, and trains a neural network to predict the rewards. The key idea is to actively synthesize the hypothetical behaviors from scratch by maximizing tractable proxies for the value of information, without interacting with the environment. We call this method reward query synthesis via trajectory optimization (ReQueST). We evaluate ReQueST with simulated users on a state-based 2D navigation task and the image-based Car Racing video game. The results show that ReQueST significantly outperforms prior methods in learning reward models that transfer to new environments with different initial state distributions. Moreover, ReQueST safely trains the reward model to detect unsafe states, and corrects reward hacking before deploying the agent.

Via

Access Paper or Ask Questions

Scaling shared model governance via model splitting

Dec 14, 2018

Miljan Martic, Jan Leike, Andrew Trask, Matteo Hessel, Shane Legg, Pushmeet Kohli

Figure 1 for Scaling shared model governance via model splitting

Figure 2 for Scaling shared model governance via model splitting

Figure 3 for Scaling shared model governance via model splitting

Figure 4 for Scaling shared model governance via model splitting

Abstract:Currently the only techniques for sharing governance of a deep learning model are homomorphic encryption and secure multiparty computation. Unfortunately, neither of these techniques is applicable to the training of large neural networks due to their large computational and communication overheads. As a scalable technique for shared model governance, we propose splitting deep learning model between multiple parties. This paper empirically investigates the security guarantee of this technique, which is introduced as the problem of model completion: Given the entire training data set or an environment simulator, and a subset of the parameters of a trained deep learning model, how much training is required to recover the model's original performance? We define a metric for evaluating the hardness of the model completion problem and study it empirically in both supervised learning on ImageNet and reinforcement learning on Atari and DeepMind~Lab. Our experiments show that (1) the model completion problem is harder in reinforcement learning than in supervised learning because of the unavailability of the trained agent's trajectories, and (2) its hardness depends not primarily on the number of parameters of the missing part, but more so on their type and location. Our results suggest that model splitting might be a feasible technique for shared model governance in some settings where training is very expensive.

* 9 pages

Via

Access Paper or Ask Questions

Scalable agent alignment via reward modeling: a research direction

Nov 19, 2018

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg

Figure 1 for Scalable agent alignment via reward modeling: a research direction

Figure 2 for Scalable agent alignment via reward modeling: a research direction

Figure 3 for Scalable agent alignment via reward modeling: a research direction

Figure 4 for Scalable agent alignment via reward modeling: a research direction

Abstract:One obstacle to applying reinforcement learning algorithms to real-world problems is the lack of suitable reward functions. Designing such reward functions is difficult in part because the user only has an implicit understanding of the task objective. This gives rise to the agent alignment problem: how do we create agents that behave in accordance with the user's intentions? We outline a high-level research direction to solve the agent alignment problem centered around reward modeling: learning a reward function from interaction with the user and optimizing the learned reward function with reinforcement learning. We discuss the key challenges we expect to face when scaling reward modeling to complex and general domains, concrete approaches to mitigate these challenges, and ways to establish trust in the resulting agents.

Via

Access Paper or Ask Questions

Reward learning from human preferences and demonstrations in Atari

Nov 15, 2018

Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, Dario Amodei

Figure 1 for Reward learning from human preferences and demonstrations in Atari

Figure 2 for Reward learning from human preferences and demonstrations in Atari

Figure 3 for Reward learning from human preferences and demonstrations in Atari

Figure 4 for Reward learning from human preferences and demonstrations in Atari

Abstract:To solve complex real-world problems with reinforcement learning, we cannot rely on manually specified reward functions. Instead, we can have humans communicate an objective to the agent directly. In this work, we combine two approaches to learning from human feedback: expert demonstrations and trajectory preferences. We train a deep neural network to model the reward function and use its predicted reward to train an DQN-based deep reinforcement learning agent on 9 Atari games. Our approach beats the imitation learning baseline in 7 games and achieves strictly superhuman performance on 2 games without using game rewards. Additionally, we investigate the goodness of fit of the reward model, present some reward hacking problems, and study the effects of noise in the human labels.

* NIPS 2018

Via

Access Paper or Ask Questions

Learning to Understand Goal Specifications by Modelling Reward

Oct 02, 2018

Dzmitry Bahdanau, Felix Hill, Jan Leike, Edward Hughes, Pushmeet Kohli, Edward Grefenstette

Figure 1 for Learning to Understand Goal Specifications by Modelling Reward

Figure 2 for Learning to Understand Goal Specifications by Modelling Reward

Figure 3 for Learning to Understand Goal Specifications by Modelling Reward

Figure 4 for Learning to Understand Goal Specifications by Modelling Reward

Abstract:Recent work has shown that deep reinforcement-learning agents can learn to follow language-like instructions from infrequent environment rewards. However, this places on environment designers the onus of designing language-conditional reward functions which may not be easily or tractably implemented as the complexity of the environment and the language scales. To overcome this limitation, we present a framework within which instruction-conditional RL agents are trained using rewards obtained not from the environment, but from reward models which are jointly trained from expert examples. As reward models improve, they learn to accurately reward agents for completing tasks for environment configurations---and for instructions---not present amongst the expert data. This framework effectively separates the representation of what instructions require from how they can be executed. In a simple grid world, it enables an agent to learn a range of commands requiring interaction with blocks and understanding of spatial relations and underspecified abstract arrangements. We further show the method allows our agent to adapt to changes in the environment without requiring new expert examples.

* 18 pages, 8 figures

Via

Access Paper or Ask Questions

AI Safety Gridworlds

Nov 28, 2017

Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, Shane Legg

Abstract:We present a suite of reinforcement learning environments illustrating various safety properties of intelligent agents. These problems include safe interruptibility, avoiding side effects, absent supervisor, reward gaming, safe exploration, as well as robustness to self-modification, distributional shift, and adversaries. To measure compliance with the intended safe behavior, we equip each environment with a performance function that is hidden from the agent. This allows us to categorize AI safety problems into robustness and specification problems, depending on whether the performance function corresponds to the observed reward function. We evaluate A2C and Rainbow, two recent deep reinforcement learning agents, on our environments and show that they are not able to solve them satisfactorily.

Via

Access Paper or Ask Questions

Deep reinforcement learning from human preferences

Jul 13, 2017

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei

Figure 1 for Deep reinforcement learning from human preferences

Figure 2 for Deep reinforcement learning from human preferences

Figure 3 for Deep reinforcement learning from human preferences

Figure 4 for Deep reinforcement learning from human preferences

Abstract:For sophisticated reinforcement learning (RL) systems to interact usefully with real-world environments, we need to communicate complex goals to these systems. In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments. We show that this approach can effectively solve complex RL tasks without access to the reward function, including Atari games and simulated robot locomotion, while providing feedback on less than one percent of our agent's interactions with the environment. This reduces the cost of human oversight far enough that it can be practically applied to state-of-the-art RL systems. To demonstrate the flexibility of our approach, we show that we can successfully train complex novel behaviors with about an hour of human time. These behaviors and environments are considerably more complex than any that have been previously learned from human feedback.

Via

Access Paper or Ask Questions