Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sergey Levine

Stanford University

Dynamical Distance Learning for Unsupervised and Semi-Supervised Skill Discovery

Jul 18, 2019

Kristian Hartikainen, Xinyang Geng, Tuomas Haarnoja, Sergey Levine

Figure 1 for Dynamical Distance Learning for Unsupervised and Semi-Supervised Skill Discovery

Figure 2 for Dynamical Distance Learning for Unsupervised and Semi-Supervised Skill Discovery

Figure 3 for Dynamical Distance Learning for Unsupervised and Semi-Supervised Skill Discovery

Figure 4 for Dynamical Distance Learning for Unsupervised and Semi-Supervised Skill Discovery

Abstract:Reinforcement learning requires manual specification of a reward function to learn a task. While in principle this reward function only needs to specify the task goal, in practice reinforcement learning can be very time-consuming or even infeasible unless the reward function is shaped so as to provide a smooth gradient towards a successful outcome. This shaping is difficult to specify by hand, particularly when the task is learned from raw observations, such as images. In this paper, we study how we can automatically learn dynamical distances: a measure of the expected number of time steps to reach a given goal state from any other state. These dynamical distances can be used to provide well-shaped reward functions for reaching new goals, making it possible to learn complex tasks efficiently. We also show that dynamical distances can be used in a semi-supervised regime, where unsupervised interaction with the environment is used to learn the dynamical distances, while a small amount of preference supervision is used to determine the task goal, without any manually engineered reward function or goal examples. We evaluate our method both in simulation and on a real-world robot. We show that our method can learn locomotion skills in simulation without any supervision. We also show that it can learn to turn a valve with a real-world 9-DoF hand, using raw image observations and ten preference labels, without any other supervision. Videos of the learned skills can be found on the project website: https://sites.google.com/view/skills-via-distance-learning.

* 9+3 pages, 6+1 figures, last two authors (Tuomas Haarnoja, Sergey Levine) advised equally

Via

Access Paper or Ask Questions

Dynamics-Aware Unsupervised Discovery of Skills

Jul 02, 2019

Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, Karol Hausman

Figure 1 for Dynamics-Aware Unsupervised Discovery of Skills

Figure 2 for Dynamics-Aware Unsupervised Discovery of Skills

Figure 3 for Dynamics-Aware Unsupervised Discovery of Skills

Figure 4 for Dynamics-Aware Unsupervised Discovery of Skills

Abstract:Conventionally, model-based reinforcement learning (MBRL) aims to learn a global model for the dynamics of the environment. A good model can potentially enable planning algorithms to generate a large variety of behaviors and solve diverse tasks. However, learning an accurate model for complex dynamical systems is difficult, and even then, the model might not generalize well outside the distribution of states on which it was trained. In this work, we combine model-based learning with model-free learning of primitives that make model-based planning easy. To that end, we aim to answer the question: how can we discover skills whose outcomes are easy to predict? We propose an unsupervised learning algorithm, Dynamics-Aware Discovery of Skills (DADS), which simultaneously discovers predictable behaviors and learns their dynamics. Our method can leverage continuous skill spaces, theoretically, allowing us to learn infinitely many behaviors even for high-dimensional state-spaces. We demonstrate that zero-shot planning in the learned latent space significantly outperforms standard MBRL and model-free goal-conditioned RL, can handle sparse-reward tasks, and substantially improves over prior hierarchical RL methods for unsupervised skill discovery.

Via

Access Paper or Ask Questions

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model

Jul 01, 2019

Alex X. Lee, Anusha Nagabandi, Pieter Abbeel, Sergey Levine

Figure 1 for Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model

Figure 2 for Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model

Figure 3 for Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model

Figure 4 for Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model

Abstract:Deep reinforcement learning (RL) algorithms can use high-capacity deep networks to learn directly from image observations. However, these kinds of observation spaces present a number of challenges in practice, since the policy must now solve two problems: a representation learning problem, and a task learning problem. In this paper, we aim to explicitly learn representations that can accelerate reinforcement learning from images. We propose the stochastic latent actor-critic (SLAC) algorithm: a sample-efficient and high-performing RL algorithm for learning policies for complex continuous control tasks directly from high-dimensional image inputs. SLAC learns a compact latent representation space using a stochastic sequential latent variable model, and then learns a critic model within this latent space. By learning a critic within a compact state space, SLAC can learn much more efficiently than standard RL methods. The proposed model improves performance substantially over alternative representations as well, such as variational autoencoders. In fact, our experimental evaluation demonstrates that the sample efficiency of our resulting method is comparable to that of model-based RL methods that directly use a similar type of model for control. Furthermore, our method outperforms both model-free and model-based alternatives in terms of final performance and sample efficiency, on a range of difficult image-based control tasks.

* Project website: https://alexlee-gk.github.io/slac/

Via

Access Paper or Ask Questions

Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives

Jun 25, 2019

Anirudh Goyal, Shagun Sodhani, Jonathan Binas, Xue Bin Peng, Sergey Levine, Yoshua Bengio

Figure 1 for Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives

Figure 2 for Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives

Figure 3 for Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives

Figure 4 for Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives

Abstract:Reinforcement learning agents that operate in diverse and complex environments can benefit from the structured decomposition of their behavior. Often, this is addressed in the context of hierarchical reinforcement learning, where the aim is to decompose a policy into lower-level primitives or options, and a higher-level meta-policy that triggers the appropriate behaviors for a given situation. However, the meta-policy must still produce appropriate decisions in all states. In this work, we propose a policy design that decomposes into primitives, similarly to hierarchical reinforcement learning, but without a high-level meta-policy. Instead, each primitive can decide for themselves whether they wish to act in the current state. We use an information-theoretic mechanism for enabling this decentralized decision: each primitive chooses how much information it needs about the current state to make a decision and the primitive that requests the most information about the current state acts in the world. The primitives are regularized to use as little information as possible, which leads to natural competition and specialization. We experimentally demonstrate that this policy architecture improves over both flat and hierarchical policies in terms of generalization.

* Preprint, Under Review

Via

Access Paper or Ask Questions

Off-Policy Evaluation via Off-Policy Classification

Jun 20, 2019

Alex Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, Sergey Levine

Figure 1 for Off-Policy Evaluation via Off-Policy Classification

Figure 2 for Off-Policy Evaluation via Off-Policy Classification

Figure 3 for Off-Policy Evaluation via Off-Policy Classification

Figure 4 for Off-Policy Evaluation via Off-Policy Classification

Abstract:In this work, we consider the problem of model selection for deep reinforcement learning (RL) in real-world environments. Typically, the performance of deep RL algorithms is evaluated via on-policy interactions with the target environment. However, comparing models in a real-world environment for the purposes of early stopping or hyperparameter tuning is costly and often practically infeasible. This leads us to examine off-policy policy evaluation (OPE) in such settings. We focus on OPE for value-based methods, which are of particular interest in deep RL, with applications like robotics, where off-policy algorithms based on Q-function estimation can often attain better sample complexity than direct policy optimization. Existing OPE metrics either rely on a model of the environment, or the use of importance sampling (IS) to correct for the data being off-policy. However, for high-dimensional observations, such as images, models of the environment can be difficult to fit and value-based methods can make IS hard to use or even ill-conditioned, especially when dealing with continuous action spaces. In this paper, we focus on the specific case of MDPs with continuous action spaces and sparse binary rewards, which is representative of many important real-world applications. We propose an alternative metric that relies on neither models nor IS, by framing OPE as a positive-unlabeled (PU) classification problem with the Q-function as the decision function. We experimentally show that this metric outperforms baselines on a number of tasks. Most importantly, it can reliably predict the relative performance of different policies in a number of generalization scenarios, including the transfer to the real-world of policies trained in simulation for an image-based robotic manipulation task.

* Accepted to ICML 2019 RL4RealLife workshop

Via

Access Paper or Ask Questions

When to Trust Your Model: Model-Based Policy Optimization

Jun 19, 2019

Michael Janner, Justin Fu, Marvin Zhang, Sergey Levine

Figure 1 for When to Trust Your Model: Model-Based Policy Optimization

Figure 2 for When to Trust Your Model: Model-Based Policy Optimization

Figure 3 for When to Trust Your Model: Model-Based Policy Optimization

Figure 4 for When to Trust Your Model: Model-Based Policy Optimization

Abstract:Designing effective model-based reinforcement learning algorithms is difficult because the ease of data generation must be weighed against the bias of model-generated data. In this paper, we study the role of model usage in policy optimization both theoretically and empirically. We first formulate and analyze a model-based reinforcement learning algorithm with a guarantee of monotonic improvement at each step. In practice, this analysis is overly pessimistic and suggests that real off-policy data is always preferable to model-generated on-policy data, but we show that an empirical estimate of model generalization can be incorporated into such analysis to justify model usage. Motivated by this analysis, we then demonstrate that a simple procedure of using short model-generated rollouts branched from real data has the benefits of more complicated model-based algorithms without the usual pitfalls. In particular, this approach surpasses the sample efficiency of prior model-based methods, matches the asymptotic performance of the best model-free algorithms, and scales to horizons that cause other model-based methods to fail entirely.

* Project page: https://people.eecs.berkeley.edu/~janner/mbpo/

Via

Access Paper or Ask Questions

SQIL: Imitation Learning via Regularized Behavioral Cloning

Jun 14, 2019

Siddharth Reddy, Anca D. Dragan, Sergey Levine

Figure 1 for SQIL: Imitation Learning via Regularized Behavioral Cloning

Figure 2 for SQIL: Imitation Learning via Regularized Behavioral Cloning

Figure 3 for SQIL: Imitation Learning via Regularized Behavioral Cloning

Figure 4 for SQIL: Imitation Learning via Regularized Behavioral Cloning

Abstract:Learning to imitate expert behavior given action demonstrations containing high-dimensional, continuous observations and unknown dynamics is a difficult problem in robotic control. Simple approaches based on behavioral cloning (BC) suffer from state distribution shift, while more complex methods that generalize to out-of-distribution states can be difficult to use, since they typically involve adversarial optimization. We propose an alternative that combines the simplicity of BC with the robustness of adversarial imitation learning. The key insight is that under the maximum entropy model of expert behavior, BC corresponds to fitting a soft Q function that maximizes the likelihood of observed actions. This perspective suggests a way to regularize BC so that it generalizes to out-of-distribution states: combine the standard maximum-likelihood objective with a penalty on the soft Bellman error of the soft Q function. We show that this penalty term gives the agent an incentive to take actions that lead it back to demonstrated states when it encounters new states. Experiments show that our method outperforms BC and GAIL on a variety of image-based and low-dimensional environments in Box2D, Atari, and MuJoCo.

Via

Access Paper or Ask Questions

Deep Reinforcement Learning for Industrial Insertion Tasks with Visual Inputs and Natural Rewards

Jun 13, 2019

Gerrit Schoettler, Ashvin Nair, Jianlan Luo, Shikhar Bahl, Juan Aparicio Ojea, Eugen Solowjow, Sergey Levine

Figure 1 for Deep Reinforcement Learning for Industrial Insertion Tasks with Visual Inputs and Natural Rewards

Figure 2 for Deep Reinforcement Learning for Industrial Insertion Tasks with Visual Inputs and Natural Rewards

Figure 3 for Deep Reinforcement Learning for Industrial Insertion Tasks with Visual Inputs and Natural Rewards

Figure 4 for Deep Reinforcement Learning for Industrial Insertion Tasks with Visual Inputs and Natural Rewards

Abstract:Connector insertion and many other tasks commonly found in modern manufacturing settings involve complex contact dynamics and friction. Since it is difficult to capture related physical effects with first-order modeling, traditional control methods often result in brittle and inaccurate controllers, which have to be manually tuned. Reinforcement learning (RL) methods have been demonstrated to be capable of learning controllers in such environments from autonomous interaction with the environment, but running RL algorithms in the real world poses sample efficiency and safety challenges. Moreover, in practical real-world settings we cannot assume access to perfect state information or dense reward signals. In this paper, we consider a variety of difficult industrial insertion tasks with visual inputs and different natural reward specifications, namely sparse rewards and goal images. We show that methods that combine RL with prior information, such as classical controllers or demonstrations, can solve these tasks from a reasonable amount of real-world interaction.

Via

Access Paper or Ask Questions

Efficient Exploration via State Marginal Matching

Jun 12, 2019

Lisa Lee, Benjamin Eysenbach, Emilio Parisotto, Eric Xing, Sergey Levine, Ruslan Salakhutdinov

Figure 1 for Efficient Exploration via State Marginal Matching

Figure 2 for Efficient Exploration via State Marginal Matching

Figure 3 for Efficient Exploration via State Marginal Matching

Figure 4 for Efficient Exploration via State Marginal Matching

Abstract:To solve tasks with sparse rewards, reinforcement learning algorithms must be equipped with suitable exploration techniques. However, it is unclear what underlying objective is being optimized by existing exploration algorithms, or how they can be altered to incorporate prior knowledge about the task. Most importantly, it is difficult to use exploration experience from one task to acquire exploration strategies for another task. We address these shortcomings by learning a single exploration policy that can quickly solve a suite of downstream tasks in a multi-task setting, amortizing the cost of learning to explore. We recast exploration as a problem of State Marginal Matching (SMM): we learn a mixture of policies for which the state marginal distribution matches a given target state distribution, which can incorporate prior knowledge about the task. Without any prior knowledge, the SMM objective reduces to maximizing the marginal state entropy. We optimize the objective by reducing it to a two-player, zero-sum game, where we iteratively fit a state density model and then update the policy to visit states with low density under this model. While many previous algorithms for exploration employ a similar procedure, they omit a crucial historical averaging step, without which the iterative procedure does not converge to a Nash equilibria. To parallelize exploration, we extend our algorithm to use mixtures of policies, wherein we discover connections between SMM and previously-proposed skill learning methods based on mutual information. On complex navigation and manipulation tasks, we demonstrate that our algorithm explores faster and adapts more quickly to new tasks.

* Videos and code: https://sites.google.com/view/state-marginal-matching

Via

Access Paper or Ask Questions

Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Jun 12, 2019

Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine

Figure 1 for Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Figure 2 for Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Figure 3 for Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Figure 4 for Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Abstract:The history of learning for control has been an exciting back and forth between two broad classes of algorithms: planning and reinforcement learning. Planning algorithms effectively reason over long horizons, but assume access to a local policy and distance metric over collision-free paths. Reinforcement learning excels at learning policies and the relative values of states, but fails to plan over long horizons. Despite the successes of each method in various domains, tasks that require reasoning over long horizons with limited feedback and high-dimensional observations remain exceedingly challenging for both planning and reinforcement learning algorithms. Frustratingly, these sorts of tasks are potentially the most useful, as they are simple to design (a human only need to provide an example goal state) and avoid reward shaping, which can bias the agent towards finding a sub-optimal solution. We introduce a general control algorithm that combines the strengths of planning and reinforcement learning to effectively solve these tasks. Our aim is to decompose the task of reaching a distant goal state into a sequence of easier tasks, each of which corresponds to reaching a subgoal. Planning algorithms can automatically find these waypoints, but only if provided with suitable abstractions of the environment -- namely, a graph consisting of nodes and edges. Our main insight is that this graph can be constructed via reinforcement learning, where a goal-conditioned value function provides edge weights, and nodes are taken to be previously seen observations in a replay buffer. Using graph search over our replay buffer, we can automatically generate this sequence of subgoals, even in image-based environments. Our algorithm, search on the replay buffer (SoRB), enables agents to solve sparse reward tasks over one hundred steps, and generalizes substantially better than standard RL algorithms.

* Run our algorithm in your browser: http://bit.ly/rl_search

Via

Access Paper or Ask Questions