Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Remi Munos

INRIA Lille

Actor-Critic Policy Optimization in Partially Observable Multiagent Environments

Oct 21, 2018

Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien Perolat, Karl Tuyls, Remi Munos, Michael Bowling

Figure 1 for Actor-Critic Policy Optimization in Partially Observable Multiagent Environments

Figure 2 for Actor-Critic Policy Optimization in Partially Observable Multiagent Environments

Figure 3 for Actor-Critic Policy Optimization in Partially Observable Multiagent Environments

Figure 4 for Actor-Critic Policy Optimization in Partially Observable Multiagent Environments

Abstract:Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero sum games, without any domain-specific state space reductions.

* NIPS 2018

Via

Access Paper or Ask Questions

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Jun 28, 2018

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning(+2 more)

Figure 1 for IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Figure 2 for IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Figure 3 for IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Figure 4 for IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Abstract:In this work we aim to solve a large collection of tasks using a single reinforcement learning agent with a single set of parameters. A key challenge is to handle the increased amount of data and extended training time. We have developed a new distributed agent IMPALA (Importance Weighted Actor-Learner Architecture) that not only uses resources more efficiently in single-machine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation. We achieve stable learning at high throughput by combining decoupled acting and learning with a novel off-policy correction method called V-trace. We demonstrate the effectiveness of IMPALA for multi-task reinforcement learning on DMLab-30 (a set of 30 tasks from the DeepMind Lab environment (Beattie et al., 2016)) and Atari-57 (all available Atari games in Arcade Learning Environment (Bellemare et al., 2013a)). Our results show that IMPALA is able to achieve better performance than previous agents with less data, and crucially exhibits positive transfer between tasks as a result of its multi-task approach.

Via

Access Paper or Ask Questions

The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning

Jun 19, 2018

Audrunas Gruslys, Will Dabney, Mohammad Gheshlaghi Azar, Bilal Piot, Marc Bellemare, Remi Munos

Figure 1 for The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning

Figure 2 for The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning

Figure 3 for The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning

Figure 4 for The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning

Abstract:In this work we present a new agent architecture, called Reactor, which combines multiple algorithmic and architectural contributions to produce an agent with higher sample-efficiency than Prioritized Dueling DQN (Wang et al., 2016) and Categorical DQN (Bellemare et al., 2017), while giving better run-time performance than A3C (Mnih et al., 2016). Our first contribution is a new policy evaluation algorithm called Distributional Retrace, which brings multi-step off-policy updates to the distributional reinforcement learning setting. The same approach can be used to convert several classes of multi-step policy evaluation algorithms designed for expected value evaluation into distributional ones. Next, we introduce the \b{eta}-leave-one-out policy gradient algorithm which improves the trade-off between variance and bias by using action values as a baseline. Our final algorithmic contribution is a new prioritized replay algorithm for sequences, which exploits the temporal locality of neighboring observations for more efficient replay prioritization. Using the Atari 2600 benchmarks, we show that each of these innovations contribute to both the sample efficiency and final agent performance. Finally, we demonstrate that Reactor reaches state-of-the-art performance after 200 million frames and less than a day of training.

Via

Access Paper or Ask Questions

Maximum a Posteriori Policy Optimisation

Jun 14, 2018

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, Martin Riedmiller

Figure 1 for Maximum a Posteriori Policy Optimisation

Figure 2 for Maximum a Posteriori Policy Optimisation

Figure 3 for Maximum a Posteriori Policy Optimisation

Figure 4 for Maximum a Posteriori Policy Optimisation

Abstract:We introduce a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective. We show that several existing methods can directly be related to our derivation. We develop two off-policy algorithms and demonstrate that they are competitive with the state-of-the-art in deep reinforcement learning. In particular, for continuous control, our method outperforms existing methods with respect to sample efficiency, premature convergence and robustness to hyperparameter settings while achieving similar or better final performance.

Via

Access Paper or Ask Questions

Low-pass Recurrent Neural Networks - A memory architecture for longer-term correlation discovery

May 13, 2018

Thomas Stepleton, Razvan Pascanu, Will Dabney, Siddhant M. Jayakumar, Hubert Soyer, Remi Munos

Figure 1 for Low-pass Recurrent Neural Networks - A memory architecture for longer-term correlation discovery

Figure 2 for Low-pass Recurrent Neural Networks - A memory architecture for longer-term correlation discovery

Figure 3 for Low-pass Recurrent Neural Networks - A memory architecture for longer-term correlation discovery

Figure 4 for Low-pass Recurrent Neural Networks - A memory architecture for longer-term correlation discovery

Abstract:Reinforcement learning (RL) agents performing complex tasks must be able to remember observations and actions across sizable time intervals. This is especially true during the initial learning stages, when exploratory behaviour can increase the delay between specific actions and their effects. Many new or popular approaches for learning these distant correlations employ backpropagation through time (BPTT), but this technique requires storing observation traces long enough to span the interval between cause and effect. Besides memory demands, learning dynamics like vanishing gradients and slow convergence due to infrequent weight updates can reduce BPTT's practicality; meanwhile, although online recurrent network learning is a developing topic, most approaches are not efficient enough to use as replacements. We propose a simple, effective memory strategy that can extend the window over which BPTT can learn without requiring longer traces. We explore this approach empirically on a few tasks and discuss its implications.

Via

Access Paper or Ask Questions

A Study on Overfitting in Deep Reinforcement Learning

Apr 20, 2018

Chiyuan Zhang, Oriol Vinyals, Remi Munos, Samy Bengio

Figure 1 for A Study on Overfitting in Deep Reinforcement Learning

Figure 2 for A Study on Overfitting in Deep Reinforcement Learning

Figure 3 for A Study on Overfitting in Deep Reinforcement Learning

Figure 4 for A Study on Overfitting in Deep Reinforcement Learning

Abstract:Recent years have witnessed significant progresses in deep Reinforcement Learning (RL). Empowered with large scale neural networks, carefully designed architectures, novel training algorithms and massively parallel computing devices, researchers are able to attack many challenging RL problems. However, in machine learning, more training power comes with a potential risk of more overfitting. As deep RL techniques are being applied to critical problems such as healthcare and finance, it is important to understand the generalization behaviors of the trained agents. In this paper, we conduct a systematic study of standard RL agents and find that they could overfit in various ways. Moreover, overfitting could happen "robustly": commonly used techniques in RL that add stochasticity do not necessarily prevent or detect overfitting. In particular, the same agents and learning algorithms could have drastically different test performance, even when all of them achieve optimal rewards during training. The observations call for more principled and careful evaluation protocols in RL. We conclude with a general discussion on overfitting in RL and a study of the generalization behaviors from the perspective of inductive bias.

Via

Access Paper or Ask Questions

Noisy Networks for Exploration

Feb 15, 2018

Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin(+2 more)

Figure 1 for Noisy Networks for Exploration

Figure 2 for Noisy Networks for Exploration

Figure 3 for Noisy Networks for Exploration

Figure 4 for Noisy Networks for Exploration

Abstract:We introduce NoisyNet, a deep reinforcement learning agent with parametric noise added to its weights, and show that the induced stochasticity of the agent's policy can be used to aid efficient exploration. The parameters of the noise are learned with gradient descent along with the remaining network weights. NoisyNet is straightforward to implement and adds little computational overhead. We find that replacing the conventional exploration heuristics for A3C, DQN and dueling agents (entropy reward and $\epsilon$-greedy respectively) with NoisyNet yields substantially higher scores for a wide range of Atari games, in some cases advancing the agent from sub to super-human performance.

* ICLR 2018

Via

Access Paper or Ask Questions

Sample Efficient Actor-Critic with Experience Replay

Jul 10, 2017

Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, Nando de Freitas

Figure 1 for Sample Efficient Actor-Critic with Experience Replay

Figure 2 for Sample Efficient Actor-Critic with Experience Replay

Figure 3 for Sample Efficient Actor-Critic with Experience Replay

Figure 4 for Sample Efficient Actor-Critic with Experience Replay

Abstract:This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems. To achieve this, the paper introduces several innovations, including truncated importance sampling with bias correction, stochastic dueling network architectures, and a new trust region policy optimization method.

* 20 pages. Prepared for ICLR 2017

Via

Access Paper or Ask Questions

Count-Based Exploration with Neural Density Models

Jun 14, 2017

Georg Ostrovski, Marc G. Bellemare, Aaron van den Oord, Remi Munos

Figure 1 for Count-Based Exploration with Neural Density Models

Figure 2 for Count-Based Exploration with Neural Density Models

Figure 3 for Count-Based Exploration with Neural Density Models

Figure 4 for Count-Based Exploration with Neural Density Models

Abstract:Bellemare et al. (2016) introduced the notion of a pseudo-count, derived from a density model, to generalize count-based exploration to non-tabular reinforcement learning. This pseudo-count was used to generate an exploration bonus for a DQN agent and combined with a mixed Monte Carlo update was sufficient to achieve state of the art on the Atari 2600 game Montezuma's Revenge. We consider two questions left open by their work: First, how important is the quality of the density model for exploration? Second, what role does the Monte Carlo update play in exploration? We answer the first question by demonstrating the use of PixelCNN, an advanced neural density model for images, to supply a pseudo-count. In particular, we examine the intrinsic difficulties in adapting Bellemare et al.'s approach when assumptions about the model are violated. The result is a more practical and general algorithm requiring no special apparatus. We combine PixelCNN pseudo-counts with different agent architectures to dramatically improve the state of the art on several hard Atari games. One surprising finding is that the mixed Monte Carlo update is a powerful facilitator of exploration in the sparsest of settings, including Montezuma's Revenge.

Via

Access Paper or Ask Questions

Automated Curriculum Learning for Neural Networks

Apr 10, 2017

Alex Graves, Marc G. Bellemare, Jacob Menick, Remi Munos, Koray Kavukcuoglu

Figure 1 for Automated Curriculum Learning for Neural Networks

Figure 2 for Automated Curriculum Learning for Neural Networks

Figure 3 for Automated Curriculum Learning for Neural Networks

Figure 4 for Automated Curriculum Learning for Neural Networks

Abstract:We introduce a method for automatically selecting the path, or syllabus, that a neural network follows through a curriculum so as to maximise learning efficiency. A measure of the amount that the network learns from each data sample is provided as a reward signal to a nonstationary multi-armed bandit algorithm, which then determines a stochastic syllabus. We consider a range of signals derived from two distinct indicators of learning progress: rate of increase in prediction accuracy, and rate of increase in network complexity. Experimental results for LSTM networks on three curricula demonstrate that our approach can significantly accelerate learning, in some cases halving the time required to attain a satisfactory performance level.

Via

Access Paper or Ask Questions