Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pieter Abbeel

UC Berkeley

Establishing Appropriate Trust via Critical States

Oct 18, 2018

Sandy H. Huang, Kush Bhatia, Pieter Abbeel, Anca D. Dragan

Figure 1 for Establishing Appropriate Trust via Critical States

Figure 2 for Establishing Appropriate Trust via Critical States

Figure 3 for Establishing Appropriate Trust via Critical States

Figure 4 for Establishing Appropriate Trust via Critical States

Abstract:In order to effectively interact with or supervise a robot, humans need to have an accurate mental model of its capabilities and how it acts. Learned neural network policies make that particularly challenging. We propose an approach for helping end-users build a mental model of such policies. Our key observation is that for most tasks, the essence of the policy is captured in a few critical states: states in which it is very important to take a certain action. Our user studies show that if the robot shows a human what its understanding of the task's critical states is, then the human can make a more informed decision about whether to deploy the policy, and if she does deploy it, when she needs to take control from it at execution time.

* IROS 2018

Via

Access Paper or Ask Questions

ProMP: Proximal Meta-Policy Search

Oct 17, 2018

Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, Pieter Abbeel

Figure 1 for ProMP: Proximal Meta-Policy Search

Figure 2 for ProMP: Proximal Meta-Policy Search

Figure 3 for ProMP: Proximal Meta-Policy Search

Abstract:Credit assignment in Meta-reinforcement learning (Meta-RL) is still poorly understood. Existing methods either neglect credit assignment to pre-adaptation behavior or implement it naively. This leads to poor sample-efficiency during meta-training as well as ineffective task identification strategies. This paper provides a theoretical analysis of credit assignment in gradient-based Meta-RL. Building on the gained insights we develop a novel meta-learning algorithm that overcomes both the issue of poor credit assignment and previous difficulties in estimating meta-policy gradients. By controlling the statistical distance of both pre-adaptation and adapted policies during meta-policy search, the proposed algorithm endows efficient and stable meta-learning. Our approach leads to superior pre-adaptation policy behavior and consistently outperforms previous Meta-RL algorithms in sample-efficiency, wall-clock time, and asymptotic performance.

* The first three authors contributed equally. The paper is under review for ICLR 2019

Via

Access Paper or Ask Questions

Composable Action-Conditioned Predictors: Flexible Off-Policy Learning for Robot Navigation

Oct 16, 2018

Gregory Kahn, Adam Villaflor, Pieter Abbeel, Sergey Levine

Figure 1 for Composable Action-Conditioned Predictors: Flexible Off-Policy Learning for Robot Navigation

Figure 2 for Composable Action-Conditioned Predictors: Flexible Off-Policy Learning for Robot Navigation

Figure 3 for Composable Action-Conditioned Predictors: Flexible Off-Policy Learning for Robot Navigation

Figure 4 for Composable Action-Conditioned Predictors: Flexible Off-Policy Learning for Robot Navigation

Abstract:A general-purpose intelligent robot must be able to learn autonomously and be able to accomplish multiple tasks in order to be deployed in the real world. However, standard reinforcement learning approaches learn separate task-specific policies and assume the reward function for each task is known a priori. We propose a framework that learns event cues from off-policy data, and can flexibly combine these event cues at test time to accomplish different tasks. These event cue labels are not assumed to be known a priori, but are instead labeled using learned models, such as computer vision detectors, and then `backed up' in time using an action-conditioned predictive model. We show that a simulated robotic car and a real-world RC car can gather data and train fully autonomously without any human-provided labels beyond those needed to train the detectors, and then at test-time be able to accomplish a variety of different tasks. Videos of the experiments and code can be found at https://github.com/gkahn13/CAPs

* Accepted to the Conference on Robot Learning (CoRL) 2018. Video at https://youtu.be/lOLT7zifEkg

Via

Access Paper or Ask Questions

SFV: Reinforcement Learning of Physical Skills from Videos

Oct 15, 2018

Xue Bin Peng, Angjoo Kanazawa, Jitendra Malik, Pieter Abbeel, Sergey Levine

Figure 1 for SFV: Reinforcement Learning of Physical Skills from Videos

Figure 2 for SFV: Reinforcement Learning of Physical Skills from Videos

Figure 3 for SFV: Reinforcement Learning of Physical Skills from Videos

Figure 4 for SFV: Reinforcement Learning of Physical Skills from Videos

Abstract:Data-driven character animation based on motion capture can produce highly naturalistic behaviors and, when combined with physics simulation, can provide for natural procedural responses to physical perturbations, environmental changes, and morphological discrepancies. Motion capture remains the most popular source of motion data, but collecting mocap data typically requires heavily instrumented environments and actors. In this paper, we propose a method that enables physically simulated characters to learn skills from videos (SFV). Our approach, based on deep pose estimation and deep reinforcement learning, allows data-driven animation to leverage the abundance of publicly available video clips from the web, such as those from YouTube. This has the potential to enable fast and easy design of character controllers simply by querying for video recordings of the desired behavior. The resulting controllers are robust to perturbations, can be adapted to new settings, can perform basic object interactions, and can be retargeted to new morphologies via reinforcement learning. We further demonstrate that our method can predict potential human motions from still images, by forward simulation of learned controllers initialized from the observed pose. Our framework is able to learn a broad range of dynamic skills, including locomotion, acrobatics, and martial arts.

Via

Access Paper or Ask Questions

Equivalence Between Policy Gradients and Soft Q-Learning

Oct 14, 2018

John Schulman, Xi Chen, Pieter Abbeel

Figure 1 for Equivalence Between Policy Gradients and Soft Q-Learning

Figure 2 for Equivalence Between Policy Gradients and Soft Q-Learning

Figure 3 for Equivalence Between Policy Gradients and Soft Q-Learning

Abstract:Two of the leading approaches for model-free reinforcement learning are policy gradient methods and $Q$-learning methods. $Q$-learning methods can be effective and sample-efficient when they work, however, it is not well-understood why they work, since empirically, the $Q$-values they estimate are very inaccurate. A partial explanation may be that $Q$-learning methods are secretly implementing policy gradient updates: we show that there is a precise equivalence between $Q$-learning and policy gradient methods in the setting of entropy-regularized reinforcement learning, that "soft" (entropy-regularized) $Q$-learning is exactly equivalent to a policy gradient method. We also point out a connection between $Q$-learning methods and natural policy gradient methods. Experimentally, we explore the entropy-regularized versions of $Q$-learning and policy gradients, and we find them to perform as well as (or slightly better than) the standard variants on the Atari benchmark. We also show that the equivalence holds in practical settings by constructing a $Q$-learning method that closely matches the learning dynamics of A3C without using a target network or $\epsilon$-greedy exploration schedule.

Via

Access Paper or Ask Questions

Model-Ensemble Trust-Region Policy Optimization

Oct 05, 2018

Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, Pieter Abbeel

Figure 1 for Model-Ensemble Trust-Region Policy Optimization

Figure 2 for Model-Ensemble Trust-Region Policy Optimization

Figure 3 for Model-Ensemble Trust-Region Policy Optimization

Figure 4 for Model-Ensemble Trust-Region Policy Optimization

Abstract:Model-free reinforcement learning (RL) methods are succeeding in a growing number of tasks, aided by recent advances in deep learning. However, they tend to suffer from high sample complexity, which hinders their use in real-world domains. Alternatively, model-based reinforcement learning promises to reduce sample complexity, but tends to require careful tuning and to date have succeeded mainly in restrictive domains where simple models are sufficient for learning. In this paper, we analyze the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and show that the learned policy tends to exploit regions where insufficient data is available for the model to be learned, causing instability in training. To overcome this issue, we propose to use an ensemble of models to maintain the model uncertainty and regularize the learning process. We further show that the use of likelihood ratio derivatives yields much more stable learning than backpropagation through time. Altogether, our approach Model-Ensemble Trust-Region Policy Optimization (ME-TRPO) significantly reduces the sample complexity compared to model-free deep RL methods on challenging continuous control benchmark tasks.

Via

Access Paper or Ask Questions

Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow

Oct 01, 2018

Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, Sergey Levine

Figure 1 for Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow

Figure 2 for Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow

Figure 3 for Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow

Figure 4 for Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow

Abstract:Adversarial learning methods have been proposed for a wide range of applications, but the training of adversarial models can be notoriously unstable. Effectively balancing the performance of the generator and discriminator is critical, since a discriminator that achieves very high accuracy will produce relatively uninformative gradients. In this work, we propose a simple and general technique to constrain information flow in the discriminator by means of an information bottleneck. By enforcing a constraint on the mutual information between the observations and the discriminator's internal representation, we can effectively modulate the discriminator's accuracy and maintain useful and informative gradients. We demonstrate that our proposed variational discriminator bottleneck (VDB) leads to significant improvements across three distinct application areas for adversarial learning algorithms. Our primary evaluation studies the applicability of the VDB to imitation learning of dynamic continuous control skills, such as running. We show that our method can learn such skills directly from \emph{raw} video demonstrations, substantially outperforming prior adversarial imitation learning methods. The VDB can also be combined with adversarial inverse reinforcement learning to learn parsimonious reward functions that can be transferred and re-optimized in new settings. Finally, we demonstrate that VDB can train GANs more effectively for image generation, improving upon a number of prior stabilization methods.

Via

Access Paper or Ask Questions

Learning with Opponent-Learning Awareness

Sep 19, 2018

Jakob N. Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, Igor Mordatch

Figure 1 for Learning with Opponent-Learning Awareness

Figure 2 for Learning with Opponent-Learning Awareness

Figure 3 for Learning with Opponent-Learning Awareness

Figure 4 for Learning with Opponent-Learning Awareness

Abstract:Multi-agent settings are quickly gathering importance in machine learning. This includes a plethora of recent work on deep multi-agent reinforcement learning, but also can be extended to hierarchical RL, generative adversarial networks and decentralised optimisation. In all these settings the presence of multiple learning agents renders the training problem non-stationary and often leads to unstable training or undesired final results. We present Learning with Opponent-Learning Awareness (LOLA), a method in which each agent shapes the anticipated learning of the other agents in the environment. The LOLA learning rule includes a term that accounts for the impact of one agent's policy on the anticipated parameter update of the other agents. Results show that the encounter of two LOLA agents leads to the emergence of tit-for-tat and therefore cooperation in the iterated prisoners' dilemma, while independent learning does not. In this domain, LOLA also receives higher payouts compared to a naive learner, and is robust against exploitation by higher order gradient-based methods. Applied to repeated matching pennies, LOLA agents converge to the Nash equilibrium. In a round robin tournament we show that LOLA agents successfully shape the learning of a range of multi-agent learning algorithms from literature, resulting in the highest average returns on the IPD. We also show that the LOLA update rule can be efficiently calculated using an extension of the policy gradient estimator, making the method suitable for model-free RL. The method thus scales to large parameter and input spaces and nonlinear function approximators. We apply LOLA to a grid world task with an embedded social dilemma using recurrent policies and opponent modelling. By explicitly considering the learning of the other agent, LOLA agents learn to cooperate out of self-interest. The code is at github.com/alshedivat/lola.

Via

Access Paper or Ask Questions

Model-Based Reinforcement Learning via Meta-Policy Optimization

Sep 14, 2018

Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, Pieter Abbeel

Figure 1 for Model-Based Reinforcement Learning via Meta-Policy Optimization

Figure 2 for Model-Based Reinforcement Learning via Meta-Policy Optimization

Figure 3 for Model-Based Reinforcement Learning via Meta-Policy Optimization

Figure 4 for Model-Based Reinforcement Learning via Meta-Policy Optimization

Abstract:Model-based reinforcement learning approaches carry the promise of being data efficient. However, due to challenges in learning dynamics models that sufficiently match the real-world dynamics, they struggle to achieve the same asymptotic performance as model-free methods. We propose Model-Based Meta-Policy-Optimization (MB-MPO), an approach that foregoes the strong reliance on accurate learned dynamics models. Using an ensemble of learned dynamic models, MB-MPO meta-learns a policy that can quickly adapt to any model in the ensemble with one policy gradient step. This steers the meta-policy towards internalizing consistent dynamics predictions among the ensemble while shifting the burden of behaving optimally w.r.t. the model discrepancies towards the adaptation step. Our experiments show that MB-MPO is more robust to model imperfections than previous model-based approaches. Finally, we demonstrate that our approach is able to match the asymptotic performance of model-free methods while requiring significantly less experience.

* First 2 authors contributed equally. Accepted for Conference on Robot Learning (CoRL)

Via

Access Paper or Ask Questions

Latent Space Policies for Hierarchical Reinforcement Learning

Sep 03, 2018

Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, Sergey Levine

Figure 1 for Latent Space Policies for Hierarchical Reinforcement Learning

Figure 2 for Latent Space Policies for Hierarchical Reinforcement Learning

Figure 3 for Latent Space Policies for Hierarchical Reinforcement Learning

Figure 4 for Latent Space Policies for Hierarchical Reinforcement Learning

Abstract:We address the problem of learning hierarchical deep neural network policies for reinforcement learning. In contrast to methods that explicitly restrict or cripple lower layers of a hierarchy to force them to use higher-level modulating signals, each layer in our framework is trained to directly solve the task, but acquires a range of diverse strategies via a maximum entropy reinforcement learning objective. Each layer is also augmented with latent random variables, which are sampled from a prior distribution during the training of that layer. The maximum entropy objective causes these latent variables to be incorporated into the layer's policy, and the higher level layer can directly control the behavior of the lower layer through this latent space. Furthermore, by constraining the mapping from latent variables to actions to be invertible, higher layers retain full expressivity: neither the higher layers nor the lower layers are constrained in their behavior. Our experimental evaluation demonstrates that we can improve on the performance of single-layer policies on standard benchmark tasks simply by adding additional layers, and that our method can solve more complex sparse-reward tasks by learning higher-level policies on top of high-entropy skills optimized for simple low-level objectives.

* ICML 2018; Videos: https://sites.google.com/view/latent-space-deep-rl Code: https://github.com/haarnoja/sac

Via

Access Paper or Ask Questions