Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pieter Abbeel

UC Berkeley

Soft Actor-Critic Algorithms and Applications

Jan 29, 2019

Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel(+1 more)

Figure 1 for Soft Actor-Critic Algorithms and Applications

Figure 2 for Soft Actor-Critic Algorithms and Applications

Figure 3 for Soft Actor-Critic Algorithms and Applications

Figure 4 for Soft Actor-Critic Algorithms and Applications

Abstract:Model-free deep reinforcement learning (RL) algorithms have been successfully applied to a range of challenging sequential decision making and control tasks. However, these methods typically suffer from two major challenges: high sample complexity and brittleness to hyperparameters. Both of these challenges limit the applicability of such methods to real-world domains. In this paper, we describe Soft Actor-Critic (SAC), our recently introduced off-policy actor-critic algorithm based on the maximum entropy RL framework. In this framework, the actor aims to simultaneously maximize expected return and entropy. That is, to succeed at the task while acting as randomly as possible. We extend SAC to incorporate a number of modifications that accelerate training and improve stability with respect to the hyperparameters, including a constrained formulation that automatically tunes the temperature hyperparameter. We systematically evaluate SAC on a range of benchmark tasks, as well as real-world challenging tasks such as locomotion for a quadrupedal robot and robotic manipulation with a dexterous hand. With these improvements, SAC achieves state-of-the-art performance, outperforming prior on-policy and off-policy methods in sample-efficiency and asymptotic performance. Furthermore, we demonstrate that, in contrast to other off-policy algorithms, our approach is very stable, achieving similar performance across different random seeds. These results suggest that SAC is a promising candidate for learning in real-world robotics tasks.

* arXiv admin note: substantial text overlap with arXiv:1801.01290

Via

Access Paper or Ask Questions

Guiding Policies with Language via Meta-Learning

Nov 19, 2018

John D. Co-Reyes, Abhishek Gupta, Suvansh Sanjeev, Nick Altieri, John DeNero, Pieter Abbeel, Sergey Levine

Figure 1 for Guiding Policies with Language via Meta-Learning

Figure 2 for Guiding Policies with Language via Meta-Learning

Figure 3 for Guiding Policies with Language via Meta-Learning

Figure 4 for Guiding Policies with Language via Meta-Learning

Abstract:Behavioral skills or policies for autonomous agents are conventionally learned from reward functions, via reinforcement learning, or from demonstrations, via imitation learning. However, both modes of task specification have their disadvantages: reward functions require manual engineering, while demonstrations require a human expert to be able to actually perform the task in order to generate the demonstration. Instruction following from natural language instructions provides an appealing alternative: in the same way that we can specify goals to other humans simply by speaking or writing, we would like to be able to specify tasks for our machines. However, a single instruction may be insufficient to fully communicate our intent or, even if it is, may be insufficient for an autonomous agent to actually understand how to perform the desired task. In this work, we propose an interactive formulation of the task specification problem, where iterative language corrections are provided to an autonomous agent, guiding it in acquiring the desired skill. Our proposed language-guided policy learning algorithm can integrate an instruction and a sequence of corrections to acquire new skills very quickly. In our experiments, we show that this method can enable a policy to follow instructions and corrections for simulated navigation and manipulation tasks, substantially outperforming direct, non-interactive instruction following.

Via

Access Paper or Ask Questions

An Algorithmic Perspective on Imitation Learning

Nov 16, 2018

Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, Jan Peters

Figure 1 for An Algorithmic Perspective on Imitation Learning

Figure 2 for An Algorithmic Perspective on Imitation Learning

Figure 3 for An Algorithmic Perspective on Imitation Learning

Figure 4 for An Algorithmic Perspective on Imitation Learning

Abstract:As robots and other intelligent agents move from simple environments and problems to more complex, unstructured settings, manually programming their behavior has become increasingly challenging and expensive. Often, it is easier for a teacher to demonstrate a desired behavior rather than attempt to manually engineer it. This process of learning from demonstrations, and the study of algorithms to do so, is called imitation learning. This work provides an introduction to imitation learning. It covers the underlying assumptions, approaches, and how they relate; the rich set of algorithms developed to tackle the problem; and advice on effective tools and implementation. We intend this paper to serve two audiences. First, we want to familiarize machine learning experts with the challenges of imitation learning, particularly those arising in robotics, and the interesting theoretical and practical distinctions between it and more familiar frameworks like statistical supervised learning theory and reinforcement learning. Second, we want to give roboticists and experts in applied artificial intelligence a broader appreciation for the frameworks and tools available for imitation learning.

* 187 pages. Published in Foundations and Trends in Robotics

Via

Access Paper or Ask Questions

Modular Architecture for StarCraft II with Deep Reinforcement Learning

Nov 08, 2018

Dennis Lee, Haoran Tang, Jeffrey O Zhang, Huazhe Xu, Trevor Darrell, Pieter Abbeel

Figure 1 for Modular Architecture for StarCraft II with Deep Reinforcement Learning

Figure 2 for Modular Architecture for StarCraft II with Deep Reinforcement Learning

Figure 3 for Modular Architecture for StarCraft II with Deep Reinforcement Learning

Figure 4 for Modular Architecture for StarCraft II with Deep Reinforcement Learning

Abstract:We present a novel modular architecture for StarCraft II AI. The architecture splits responsibilities between multiple modules that each control one aspect of the game, such as build-order selection or tactics. A centralized scheduler reviews macros suggested by all modules and decides their order of execution. An updater keeps track of environment changes and instantiates macros into series of executable actions. Modules in this framework can be optimized independently or jointly via human design, planning, or reinforcement learning. We apply deep reinforcement learning techniques to training two out of six modules of a modular agent with self-play, achieving 94% or 87% win rates against the "Harder" (level 5) built-in Blizzard bot in Zerg vs. Zerg matches, with or without fog-of-war.

* Accepted to The 14th AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE'18)

Via

Access Paper or Ask Questions

One-Shot Hierarchical Imitation Learning of Compound Visuomotor Tasks

Oct 25, 2018

Tianhe Yu, Pieter Abbeel, Sergey Levine, Chelsea Finn

Figure 1 for One-Shot Hierarchical Imitation Learning of Compound Visuomotor Tasks

Figure 2 for One-Shot Hierarchical Imitation Learning of Compound Visuomotor Tasks

Figure 3 for One-Shot Hierarchical Imitation Learning of Compound Visuomotor Tasks

Figure 4 for One-Shot Hierarchical Imitation Learning of Compound Visuomotor Tasks

Abstract:We consider the problem of learning multi-stage vision-based tasks on a real robot from a single video of a human performing the task, while leveraging demonstration data of subtasks with other objects. This problem presents a number of major challenges. Video demonstrations without teleoperation are easy for humans to provide, but do not provide any direct supervision. Learning policies from raw pixels enables full generality but calls for large function approximators with many parameters to be learned. Finally, compound tasks can require impractical amounts of demonstration data, when treated as a monolithic skill. To address these challenges, we propose a method that learns both how to learn primitive behaviors from video demonstrations and how to dynamically compose these behaviors to perform multi-stage tasks by "watching" a human demonstrator. Our results on a simulated Sawyer robot and real PR2 robot illustrate our method for learning a variety of order fulfillment and kitchen serving tasks with novel objects and raw pixel inputs.

* Video results available at https://sites.google.com/view/one-shot-hil

Via

Access Paper or Ask Questions

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Oct 20, 2018

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel

Figure 1 for High-Dimensional Continuous Control Using Generalized Advantage Estimation

Figure 2 for High-Dimensional Continuous Control Using Generalized Advantage Estimation

Figure 3 for High-Dimensional Continuous Control Using Generalized Advantage Estimation

Figure 4 for High-Dimensional Continuous Control Using Generalized Advantage Estimation

Abstract:Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. The two main challenges are the large number of samples typically required, and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data. We address the first challenge by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(lambda). We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks. Our approach yields strong empirical results on highly challenging 3D locomotion tasks, learning running gaits for bipedal and quadrupedal simulated robots, and learning a policy for getting the biped to stand up from starting out lying on the ground. In contrast to a body of prior work that uses hand-crafted policy representations, our neural network policies map directly from raw kinematics to joint torques. Our algorithm is fully model-free, and the amount of simulated experience required for the learning tasks on 3D bipeds corresponds to 1-2 weeks of real time.

Via

Access Paper or Ask Questions

Enabling Robots to Communicate their Objectives

Oct 18, 2018

Sandy H. Huang, David Held, Pieter Abbeel, Anca D. Dragan

Figure 1 for Enabling Robots to Communicate their Objectives

Figure 2 for Enabling Robots to Communicate their Objectives

Figure 3 for Enabling Robots to Communicate their Objectives

Figure 4 for Enabling Robots to Communicate their Objectives

Abstract:The overarching goal of this work is to efficiently enable end-users to correctly anticipate a robot's behavior in novel situations. Since a robot's behavior is often a direct result of its underlying objective function, our insight is that end-users need to have an accurate mental model of this objective function in order to understand and predict what the robot will do. While people naturally develop such a mental model over time through observing the robot act, this familiarization process may be lengthy. Our approach reduces this time by having the robot model how people infer objectives from observed behavior, and then it selects those behaviors that are maximally informative. The problem of computing a posterior over objectives from observed behavior is known as Inverse Reinforcement Learning (IRL), and has been applied to robots learning human objectives. We consider the problem where the roles of human and robot are swapped. Our main contribution is to recognize that unlike robots, humans will not be exact in their IRL inference. We thus introduce two factors to define candidate approximate-inference models for human learning in this setting, and analyze them in a user study in the autonomous driving domain. We show that certain approximate-inference models lead to the robot generating example behaviors that better enable users to anticipate what it will do in novel situations. Our results also suggest, however, that additional research is needed in modeling how humans extrapolate from examples of robot behavior.

* RSS 2017

Via

Access Paper or Ask Questions

Establishing Appropriate Trust via Critical States

Oct 18, 2018

Sandy H. Huang, Kush Bhatia, Pieter Abbeel, Anca D. Dragan

Figure 1 for Establishing Appropriate Trust via Critical States

Figure 2 for Establishing Appropriate Trust via Critical States

Figure 3 for Establishing Appropriate Trust via Critical States

Figure 4 for Establishing Appropriate Trust via Critical States

Abstract:In order to effectively interact with or supervise a robot, humans need to have an accurate mental model of its capabilities and how it acts. Learned neural network policies make that particularly challenging. We propose an approach for helping end-users build a mental model of such policies. Our key observation is that for most tasks, the essence of the policy is captured in a few critical states: states in which it is very important to take a certain action. Our user studies show that if the robot shows a human what its understanding of the task's critical states is, then the human can make a more informed decision about whether to deploy the policy, and if she does deploy it, when she needs to take control from it at execution time.

* IROS 2018

Via

Access Paper or Ask Questions

ProMP: Proximal Meta-Policy Search

Oct 17, 2018

Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, Pieter Abbeel

Figure 1 for ProMP: Proximal Meta-Policy Search

Figure 2 for ProMP: Proximal Meta-Policy Search

Figure 3 for ProMP: Proximal Meta-Policy Search

Abstract:Credit assignment in Meta-reinforcement learning (Meta-RL) is still poorly understood. Existing methods either neglect credit assignment to pre-adaptation behavior or implement it naively. This leads to poor sample-efficiency during meta-training as well as ineffective task identification strategies. This paper provides a theoretical analysis of credit assignment in gradient-based Meta-RL. Building on the gained insights we develop a novel meta-learning algorithm that overcomes both the issue of poor credit assignment and previous difficulties in estimating meta-policy gradients. By controlling the statistical distance of both pre-adaptation and adapted policies during meta-policy search, the proposed algorithm endows efficient and stable meta-learning. Our approach leads to superior pre-adaptation policy behavior and consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance.

* The first three authors contributed equally. The paper is under review for ICLR 2019

Via

Access Paper or Ask Questions

Composable Action-Conditioned Predictors: Flexible Off-Policy Learning for Robot Navigation

Oct 16, 2018

Gregory Kahn, Adam Villaflor, Pieter Abbeel, Sergey Levine

Figure 1 for Composable Action-Conditioned Predictors: Flexible Off-Policy Learning for Robot Navigation

Figure 2 for Composable Action-Conditioned Predictors: Flexible Off-Policy Learning for Robot Navigation

Figure 3 for Composable Action-Conditioned Predictors: Flexible Off-Policy Learning for Robot Navigation

Figure 4 for Composable Action-Conditioned Predictors: Flexible Off-Policy Learning for Robot Navigation

Abstract:A general-purpose intelligent robot must be able to learn autonomously and be able to accomplish multiple tasks in order to be deployed in the real world. However, standard reinforcement learning approaches learn separate task-specific policies and assume the reward function for each task is known a priori. We propose a framework that learns event cues from off-policy data, and can flexibly combine these event cues at test time to accomplish different tasks. These event cue labels are not assumed to be known a priori, but are instead labeled using learned models, such as computer vision detectors, and then `backed up' in time using an action-conditioned predictive model. We show that a simulated robotic car and a real-world RC car can gather data and train fully autonomously without any human-provided labels beyond those needed to train the detectors, and then at test-time be able to accomplish a variety of different tasks. Videos of the experiments and code can be found at https://github.com/gkahn13/CAPs

* Accepted to the Conference on Robot Learning (CoRL) 2018. Video at https://youtu.be/lOLT7zifEkg

Via

Access Paper or Ask Questions