Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Volodymyr Mnih

Unsupervised Control Through Non-Parametric Discriminative Rewards

Nov 28, 2018

David Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin Ionescu, Steven Hansen, Volodymyr Mnih

Figure 1 for Unsupervised Control Through Non-Parametric Discriminative Rewards

Figure 2 for Unsupervised Control Through Non-Parametric Discriminative Rewards

Figure 3 for Unsupervised Control Through Non-Parametric Discriminative Rewards

Figure 4 for Unsupervised Control Through Non-Parametric Discriminative Rewards

Abstract:Learning to control an environment without hand-crafted rewards or expert data remains challenging and is at the frontier of reinforcement learning research. We present an unsupervised learning algorithm to train agents to achieve perceptually-specified goals using only a stream of observations and actions. Our agent simultaneously learns a goal-conditioned policy and a goal achievement reward function that measures how similar a state is to the goal state. This dual optimization leads to a co-operative game, giving rise to a learned reward function that reflects similarity in controllable aspects of the environment instead of distance in the space of observations. We demonstrate the efficacy of our agent to learn, in an unsupervised manner, to reach a diverse set of goals on three domains -- Atari, the DeepMind Control Suite and DeepMind Lab.

* 10 pages + references & 5 page appendix

Via

Access Paper or Ask Questions

The Uncertainty Bellman Equation and Exploration

Oct 22, 2018

Brendan O'Donoghue, Ian Osband, Remi Munos, Volodymyr Mnih

Figure 1 for The Uncertainty Bellman Equation and Exploration

Figure 2 for The Uncertainty Bellman Equation and Exploration

Figure 3 for The Uncertainty Bellman Equation and Exploration

Figure 4 for The Uncertainty Bellman Equation and Exploration

Abstract:We consider the exploration/exploitation problem in reinforcement learning. For exploitation, it is well known that the Bellman equation connects the value at any time-step to the expected value at subsequent time-steps. In this paper we consider a similar \textit{uncertainty} Bellman equation (UBE), which connects the uncertainty at any time-step to the expected uncertainties at subsequent time-steps, thereby extending the potential exploratory benefit of a policy beyond individual time-steps. We prove that the unique fixed point of the UBE yields an upper bound on the variance of the posterior distribution of the Q-values induced by any policy. This bound can be much tighter than traditional count-based bonuses that compound standard deviation rather than variance. Importantly, and unlike several existing approaches to optimism, this method scales naturally to large systems with complex generalization. Substituting our UBE-exploration strategy for $\epsilon$-greedy improves DQN performance on 51 out of 57 games in the Atari suite.

Via

Access Paper or Ask Questions

Learning by Playing - Solving Sparse Reward Tasks from Scratch

Feb 28, 2018

Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Van de Wiele, Volodymyr Mnih, Nicolas Heess, Jost Tobias Springenberg

Figure 1 for Learning by Playing - Solving Sparse Reward Tasks from Scratch

Figure 2 for Learning by Playing - Solving Sparse Reward Tasks from Scratch

Figure 3 for Learning by Playing - Solving Sparse Reward Tasks from Scratch

Figure 4 for Learning by Playing - Solving Sparse Reward Tasks from Scratch

Abstract:We propose Scheduled Auxiliary Control (SAC-X), a new learning paradigm in the context of Reinforcement Learning (RL). SAC-X enables learning of complex behaviors - from scratch - in the presence of multiple sparse reward signals. To this end, the agent is equipped with a set of general auxiliary tasks, that it attempts to learn simultaneously via off-policy RL. The key idea behind our method is that active (learned) scheduling and execution of auxiliary policies allows the agent to efficiently explore its environment - enabling it to excel at sparse reward RL. Our experiments in several challenging robotic manipulation settings demonstrate the power of our approach.

* A video of the rich set of learned behaviours can be found at https://youtu.be/mPKyvocNe_M

Via

Access Paper or Ask Questions

Sample Efficient Actor-Critic with Experience Replay

Jul 10, 2017

Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, Nando de Freitas

Figure 1 for Sample Efficient Actor-Critic with Experience Replay

Figure 2 for Sample Efficient Actor-Critic with Experience Replay

Figure 3 for Sample Efficient Actor-Critic with Experience Replay

Figure 4 for Sample Efficient Actor-Critic with Experience Replay

Abstract:This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems. To achieve this, the paper introduces several innovations, including truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization method.

* 20 pages. Prepared for ICLR 2017

Via

Access Paper or Ask Questions

Combining policy gradient and Q-learning

Apr 07, 2017

Brendan O'Donoghue, Remi Munos, Koray Kavukcuoglu, Volodymyr Mnih

Figure 1 for Combining policy gradient and Q-learning

Figure 2 for Combining policy gradient and Q-learning

Figure 3 for Combining policy gradient and Q-learning

Figure 4 for Combining policy gradient and Q-learning

Abstract:Policy gradient is an efficient technique for improving a policy in a reinforcement learning setting. However, vanilla online variants are on-policy only and not able to take advantage of off-policy data. In this paper we describe a new technique that combines policy gradient with off-policy Q-learning, drawing experience from a replay buffer. This is motivated by making a connection between the fixed points of the regularized policy gradient algorithm and the Q-values. This connection allows us to estimate the Q-values from the action preferences of the policy, to which we apply Q-learning updates. We refer to the new technique as 'PGQL', for policy gradient and Q-learning. We also establish an equivalency between action-value fitting techniques and actor-critic algorithms, showing that regularized policy gradient techniques can be interpreted as advantage function learning algorithms. We conclude with some numerical examples that demonstrate improved data efficiency and stability of PGQL. In particular, we tested PGQL on the full suite of Atari games and achieved performance exceeding that of both asynchronous advantage actor-critic (A3C) and Q-learning.

Via

Access Paper or Ask Questions

Using Fast Weights to Attend to the Recent Past

Dec 05, 2016

Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, Catalin Ionescu

Figure 1 for Using Fast Weights to Attend to the Recent Past

Figure 2 for Using Fast Weights to Attend to the Recent Past

Figure 3 for Using Fast Weights to Attend to the Recent Past

Figure 4 for Using Fast Weights to Attend to the Recent Past

Abstract:Until recently, research on artificial neural networks was largely restricted to systems with only two types of variable: Neural activities that represent the current or recent input and weights that learn to capture regularities among inputs, outputs and payoffs. There is no good reason for this restriction. Synapses have dynamics at many different time-scales and this suggests that artificial neural networks might benefit from variables that change slower than activities but much faster than the standard weights. These "fast weights" can be used to store temporary memories of the recent past and they provide a neurally plausible way of implementing the type of attention to the past that has recently proved very helpful in sequence-to-sequence models. By using fast weights we can avoid the need to store copies of neural activity patterns.

* Added [Schmidhuber 1993] citation to the last paragraph of the introduction. Fixed typo appendix A.1 uniform initialization to 1/\sqrt{H}

Via

Access Paper or Ask Questions

Reinforcement Learning with Unsupervised Auxiliary Tasks

Nov 16, 2016

Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, Koray Kavukcuoglu

Figure 1 for Reinforcement Learning with Unsupervised Auxiliary Tasks

Figure 2 for Reinforcement Learning with Unsupervised Auxiliary Tasks

Figure 3 for Reinforcement Learning with Unsupervised Auxiliary Tasks

Figure 4 for Reinforcement Learning with Unsupervised Auxiliary Tasks

Abstract:Deep reinforcement learning agents have achieved state-of-the-art results by directly maximising cumulative reward. However, environments contain a much wider variety of possible training signals. In this paper, we introduce an agent that also maximises many other pseudo-reward functions simultaneously by reinforcement learning. All of these tasks share a common representation that, like unsupervised learning, continues to develop in the absence of extrinsic rewards. We also introduce a novel mechanism for focusing this representation upon extrinsic rewards, so that learning can rapidly adapt to the most relevant aspects of the actual task. Our agent significantly outperforms the previous state-of-the-art on Atari, averaging 880\% expert human performance, and a challenging suite of first-person, three-dimensional \emph{Labyrinth} tasks leading to a mean speedup in learning of 10$\times$ and averaging 87\% expert human performance on Labyrinth.

Via

Access Paper or Ask Questions

Learning values across many orders of magnitude

Aug 16, 2016

Hado van Hasselt, Arthur Guez, Matteo Hessel, Volodymyr Mnih, David Silver

Figure 1 for Learning values across many orders of magnitude

Figure 2 for Learning values across many orders of magnitude

Figure 3 for Learning values across many orders of magnitude

Abstract:Most learning algorithms are not invariant to the scale of the function that is being approximated. We propose to adaptively normalize the targets used in learning. This is useful in value-based reinforcement learning, where the magnitude of appropriate value approximations can change over time when we update the policy of behavior. Our main motivation is prior work on learning to play Atari games, where the rewards were all clipped to a predetermined range. This clipping facilitates learning across many different games with a single learning algorithm, but a clipped reward function can result in qualitatively different behavior. Using the adaptive normalization we can remove this domain-specific heuristic without diminishing overall performance.

* Paper accepted for publication at NIPS 2016. This version includes the appendix

Via

Access Paper or Ask Questions

Asynchronous Methods for Deep Reinforcement Learning

Jun 16, 2016

Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu

Figure 1 for Asynchronous Methods for Deep Reinforcement Learning

Figure 2 for Asynchronous Methods for Deep Reinforcement Learning

Figure 3 for Asynchronous Methods for Deep Reinforcement Learning

Figure 4 for Asynchronous Methods for Deep Reinforcement Learning

Abstract:We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task of navigating random 3D mazes using a visual input.

* ICML 2016

Via

Access Paper or Ask Questions

Strategic Attentive Writer for Learning Macro-Actions

Jun 15, 2016

Alexander, Vezhnevets, Volodymyr Mnih, John Agapiou, Simon Osindero, Alex Graves, Oriol Vinyals, Koray Kavukcuoglu

Figure 1 for Strategic Attentive Writer for Learning Macro-Actions

Figure 2 for Strategic Attentive Writer for Learning Macro-Actions

Figure 3 for Strategic Attentive Writer for Learning Macro-Actions

Figure 4 for Strategic Attentive Writer for Learning Macro-Actions

Abstract:We present a novel deep recurrent neural network architecture that learns to build implicit plans in an end-to-end manner by purely interacting with an environment in reinforcement learning setting. The network builds an internal plan, which is continuously updated upon observation of the next input from the environment. It can also partition this internal representation into contiguous sub- sequences by learning for how long the plan can be committed to - i.e. followed without re-planing. Combining these properties, the proposed model, dubbed STRategic Attentive Writer (STRAW) can learn high-level, temporally abstracted macro- actions of varying lengths that are solely learnt from data without any prior information. These macro-actions enable both structured exploration and economic computation. We experimentally demonstrate that STRAW delivers strong improvements on several ATARI games by employing temporally extended planning strategies (e.g. Ms. Pacman and Frostbite). It is at the same time a general algorithm that can be applied on any sequence data. To that end, we also show that when trained on text prediction task, STRAW naturally predicts frequent n-grams (instead of macro-actions), demonstrating the generality of the approach.

Via

Access Paper or Ask Questions