Alert button
Picture for Joshua Achiam

Joshua Achiam

Alert button

A Hazard Analysis Framework for Code Synthesis Large Language Models

Jul 25, 2022
Heidy Khlaaf, Pamela Mishkin, Joshua Achiam, Gretchen Krueger, Miles Brundage

Figure 1 for A Hazard Analysis Framework for Code Synthesis Large Language Models
Figure 2 for A Hazard Analysis Framework for Code Synthesis Large Language Models
Figure 3 for A Hazard Analysis Framework for Code Synthesis Large Language Models

Codex, a large language model (LLM) trained on a variety of codebases, exceeds the previous state of the art in its capacity to synthesize and generate code. Although Codex provides a plethora of benefits, models that may generate code on such scale have significant limitations, alignment problems, the potential to be misused, and the possibility to increase the rate of progress in technical fields that may themselves have destabilizing impacts or have misuse potential. Yet such safety impacts are not yet known or remain to be explored. In this paper, we outline a hazard analysis framework constructed at OpenAI to uncover hazards or safety risks that the deployment of models like Codex may impose technically, socially, politically, and economically. The analysis is informed by a novel evaluation framework that determines the capacity of advanced code generation techniques against the complexity and expressivity of specification prompts, and their capability to understand and execute them relative to human ability.

Viaarxiv icon

Responsive Safety in Reinforcement Learning by PID Lagrangian Methods

Jul 08, 2020
Adam Stooke, Joshua Achiam, Pieter Abbeel

Figure 1 for Responsive Safety in Reinforcement Learning by PID Lagrangian Methods
Figure 2 for Responsive Safety in Reinforcement Learning by PID Lagrangian Methods
Figure 3 for Responsive Safety in Reinforcement Learning by PID Lagrangian Methods
Figure 4 for Responsive Safety in Reinforcement Learning by PID Lagrangian Methods

Lagrangian methods are widely used algorithms for constrained optimization problems, but their learning dynamics exhibit oscillations and overshoot which, when applied to safe reinforcement learning, leads to constraint-violating behavior during agent training. We address this shortcoming by proposing a novel Lagrange multiplier update method that utilizes derivatives of the constraint function. We take a controls perspective, wherein the traditional Lagrange multiplier update behaves as \emph{integral} control; our terms introduce \emph{proportional} and \emph{derivative} control, achieving favorable learning dynamics through damping and predictive measures. We apply our PID Lagrangian methods in deep RL, setting a new state of the art in Safety Gym, a safe RL benchmark. Lastly, we introduce a new method to ease controller tuning by providing invariance to the relative numerical scales of reward and cost. Our extensive experiments demonstrate improved performance and hyperparameter robustness, while our algorithms remain nearly as simple to derive and implement as the traditional Lagrangian approach.

* ICML 2020 
Viaarxiv icon

Towards Characterizing Divergence in Deep Q-Learning

Mar 21, 2019
Joshua Achiam, Ethan Knight, Pieter Abbeel

Figure 1 for Towards Characterizing Divergence in Deep Q-Learning
Figure 2 for Towards Characterizing Divergence in Deep Q-Learning
Figure 3 for Towards Characterizing Divergence in Deep Q-Learning
Figure 4 for Towards Characterizing Divergence in Deep Q-Learning

Deep Q-Learning (DQL), a family of temporal difference algorithms for control, employs three techniques collectively known as the `deadly triad' in reinforcement learning: bootstrapping, off-policy learning, and function approximation. Prior work has demonstrated that together these can lead to divergence in Q-learning algorithms, but the conditions under which divergence occurs are not well-understood. In this note, we give a simple analysis based on a linear approximation to the Q-value updates, which we believe provides insight into divergence under the deadly triad. The central point in our analysis is to consider when the leading order approximation to the deep-Q update is or is not a contraction in the sup norm. Based on this analysis, we develop an algorithm which permits stable deep Q-learning for continuous control without any of the tricks conventionally used (such as target networks, adaptive gradient optimizers, or using multiple Q functions). We demonstrate that our algorithm performs above or near state-of-the-art on standard MuJoCo benchmarks from the OpenAI Gym.

Viaarxiv icon

On First-Order Meta-Learning Algorithms

Oct 22, 2018
Alex Nichol, Joshua Achiam, John Schulman

Figure 1 for On First-Order Meta-Learning Algorithms
Figure 2 for On First-Order Meta-Learning Algorithms
Figure 3 for On First-Order Meta-Learning Algorithms
Figure 4 for On First-Order Meta-Learning Algorithms

This paper considers meta-learning problems, where there is a distribution of tasks, and we would like to obtain an agent that performs well (i.e., learns quickly) when presented with a previously unseen task sampled from this distribution. We analyze a family of algorithms for learning a parameter initialization that can be fine-tuned quickly on a new task, using only first-order derivatives for the meta-learning updates. This family includes and generalizes first-order MAML, an approximation to MAML obtained by ignoring second-order derivatives. It also includes Reptile, a new algorithm that we introduce here, which works by repeatedly sampling a task, training on it, and moving the initialization towards the trained weights on that task. We expand on the results from Finn et al. showing that first-order meta-learning algorithms perform well on some well-established benchmarks for few-shot classification, and we provide theoretical analysis aimed at understanding why these algorithms work.

Viaarxiv icon

Variational Option Discovery Algorithms

Jul 26, 2018
Joshua Achiam, Harrison Edwards, Dario Amodei, Pieter Abbeel

Figure 1 for Variational Option Discovery Algorithms
Figure 2 for Variational Option Discovery Algorithms
Figure 3 for Variational Option Discovery Algorithms
Figure 4 for Variational Option Discovery Algorithms

We explore methods for option discovery based on variational inference and make two algorithmic contributions. First: we highlight a tight connection between variational option discovery methods and variational autoencoders, and introduce Variational Autoencoding Learning of Options by Reinforcement (VALOR), a new method derived from the connection. In VALOR, the policy encodes contexts from a noise distribution into trajectories, and the decoder recovers the contexts from the complete trajectories. Second: we propose a curriculum learning approach where the number of contexts seen by the agent increases whenever the agent's performance is strong enough (as measured by the decoder) on the current set of contexts. We show that this simple trick stabilizes training for VALOR and prior variational option discovery methods, allowing a single agent to learn many more modes of behavior than it could with a fixed context distribution. Finally, we investigate other topics related to variational option discovery, including fundamental limitations of the general approach and the applicability of learned options to downstream tasks.

Viaarxiv icon

Constrained Policy Optimization

May 30, 2017
Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel

Figure 1 for Constrained Policy Optimization

For many applications of reinforcement learning it can be more convenient to specify both a reward function and constraints, rather than trying to design behavior through the reward function. For example, systems that physically interact with or around humans should satisfy safety constraints. Recent advances in policy search algorithms (Mnih et al., 2016, Schulman et al., 2015, Lillicrap et al., 2016, Levine et al., 2016) have enabled new capabilities in high-dimensional control, but do not consider the constrained setting. We propose Constrained Policy Optimization (CPO), the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration. Our method allows us to train neural network policies for high-dimensional control while making guarantees about policy behavior all throughout training. Our guarantees are based on a new theoretical result, which is of independent interest: we prove a bound relating the expected returns of two policies to an average divergence between them. We demonstrate the effectiveness of our approach on simulated robot locomotion tasks where the agent must satisfy constraints motivated by safety.

* Accepted to ICML 2017 
Viaarxiv icon

Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning

Mar 06, 2017
Joshua Achiam, Shankar Sastry

Figure 1 for Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning
Figure 2 for Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning
Figure 3 for Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning
Figure 4 for Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning

Exploration in complex domains is a key challenge in reinforcement learning, especially for tasks with very sparse rewards. Recent successes in deep reinforcement learning have been achieved mostly using simple heuristic exploration strategies such as $\epsilon$-greedy action selection or Gaussian control noise, but there are many tasks where these methods are insufficient to make any learning progress. Here, we consider more complex heuristics: efficient and scalable exploration strategies that maximize a notion of an agent's surprise about its experiences via intrinsic motivation. We propose to learn a model of the MDP transition probabilities concurrently with the policy, and to form intrinsic rewards that approximate the KL-divergence of the true transition probabilities from the learned model. One of our approximations results in using surprisal as intrinsic motivation, while the other gives the $k$-step learning progress. We show that our incentives enable agents to succeed in a wide range of environments with high-dimensional state spaces and very sparse rewards, including continuous control tasks and games in the Atari RAM domain, outperforming several other heuristic exploration techniques.

* Appeared in Deep RL Workshop at NIPS 2016 
Viaarxiv icon

Easy Monotonic Policy Iteration

Feb 29, 2016
Joshua Achiam

A key problem in reinforcement learning for control with general function approximators (such as deep neural networks and other nonlinear functions) is that, for many algorithms employed in practice, updates to the policy or $Q$-function may fail to improve performance---or worse, actually cause the policy performance to degrade. Prior work has addressed this for policy iteration by deriving tight policy improvement bounds; by optimizing the lower bound on policy improvement, a better policy is guaranteed. However, existing approaches suffer from bounds that are hard to optimize in practice because they include sup norm terms which cannot be efficiently estimated or differentiated. In this work, we derive a better policy improvement bound where the sup norm of the policy divergence has been replaced with an average divergence; this leads to an algorithm, Easy Monotonic Policy Iteration, that generates sequences of policies with guaranteed non-decreasing returns and is easy to implement in a sample-based framework.

Viaarxiv icon