Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nando de Freitas

University of British Columbia

Acme: A Research Framework for Distributed Reinforcement Learning

Jun 01, 2020

Matt Hoffman, Bobak Shahriari, John Aslanides, Gabriel Barth-Maron, Feryal Behbahani, Tamara Norman, Abbas Abdolmaleki, Albin Cassirer, Fan Yang, Kate Baumli(+10 more)

Figure 1 for Acme: A Research Framework for Distributed Reinforcement Learning

Figure 2 for Acme: A Research Framework for Distributed Reinforcement Learning

Figure 3 for Acme: A Research Framework for Distributed Reinforcement Learning

Figure 4 for Acme: A Research Framework for Distributed Reinforcement Learning

Abstract:Deep reinforcement learning has led to many recent-and groundbreaking-advancements. However, these advances have often come at the cost of both the scale and complexity of the underlying RL algorithms. Increases in complexity have in turn made it more difficult for researchers to reproduce published RL algorithms or rapidly prototype ideas. To address this, we introduce Acme, a tool to simplify the development of novel RL algorithms that is specifically designed to enable simple agent implementations that can be run at various scales of execution. Our aim is also to make the results of various RL algorithms developed in academia and industrial labs easier to reproduce and extend. To this end we are releasing baseline implementations of various algorithms, created using our framework. In this work we introduce the major design decisions behind Acme and show how these are used to construct these baselines. We also experiment with these agents at different scales of both complexity and computation-including distributed versions. Ultimately, we show that the design decisions behind Acme lead to agents that can be scaled both up and down and that, for the most part, greater levels of parallelization result in agents with equivalent performance, just faster.

Via

Access Paper or Ask Questions

Task-Relevant Adversarial Imitation Learning

Oct 02, 2019

Konrad Zolna, Scott Reed, Alexander Novikov, Sergio Gomez Colmenarej, David Budden, Serkan Cabi, Misha Denil, Nando de Freitas, Ziyu Wang

Figure 1 for Task-Relevant Adversarial Imitation Learning

Figure 2 for Task-Relevant Adversarial Imitation Learning

Figure 3 for Task-Relevant Adversarial Imitation Learning

Figure 4 for Task-Relevant Adversarial Imitation Learning

Abstract:We show that a critical problem in adversarial imitation from high-dimensional sensory data is the tendency of discriminator networks to distinguish agent and expert behaviour using task-irrelevant features beyond the control of the agent. We analyze this problem in detail and propose a solution as well as several baselines that outperform standard Generative Adversarial Imitation Learning (GAIL). Our proposed solution, Task-Relevant Adversarial Imitation Learning (TRAIL), uses a constrained optimization objective to overcome task-irrelevant features. Comprehensive experiments show that TRAIL can solve challenging manipulation tasks from pixels by imitating human operators, where other agents such as behaviour cloning (BC), standard GAIL, improved GAIL variants including our newly proposed baselines, and Deterministic Policy Gradients from Demonstrations (DPGfD) fail to find solutions, even when the other agents have access to task reward.

Via

Access Paper or Ask Questions

A Framework for Data-Driven Robotics

Sep 26, 2019

Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Żołna, Yusuf Aytar, David Budden, Mel Vecerik(+6 more)

Figure 1 for A Framework for Data-Driven Robotics

Figure 2 for A Framework for Data-Driven Robotics

Figure 3 for A Framework for Data-Driven Robotics

Figure 4 for A Framework for Data-Driven Robotics

Abstract:We present a framework for data-driven robotics that makes use of a large dataset of recorded robot experience and scales to several tasks using learned reward functions. We show how to apply this framework to accomplish three different object manipulation tasks on a real robot platform. Given demonstrations of a task together with task-agnostic recorded experience, we use a special form of human annotation as supervision to learn a reward function, which enables us to deal with real-world tasks where the reward signal cannot be acquired directly. Learned rewards are used in combination with a large dataset of experience from different tasks to learn a robot policy offline using batch RL. We show that using our approach it is possible to train agents to perform a variety of challenging manipulation tasks including stacking rigid objects and handling cloth.

Via

Access Paper or Ask Questions

Modular Meta-Learning with Shrinkage

Sep 12, 2019

Yutian Chen, Abram L. Friesen, Feryal Behbahani, David Budden, Matthew W. Hoffman, Arnaud Doucet, Nando de Freitas

Figure 1 for Modular Meta-Learning with Shrinkage

Figure 2 for Modular Meta-Learning with Shrinkage

Figure 3 for Modular Meta-Learning with Shrinkage

Figure 4 for Modular Meta-Learning with Shrinkage

Abstract:Most gradient-based approaches to meta-learning do not explicitly account for the fact that different parts of the underlying model adapt by different amounts when applied to a new task. For example, the input layers of an image classification convnet typically adapt very little, while the output layers can change significantly. This can cause parts of the model to begin to overfit while others underfit. To address this, we introduce a hierarchical Bayesian model with per-module shrinkage parameters, which we propose to learn by maximizing an approximation of the predictive likelihood using implicit differentiation. Our algorithm subsumes Reptile and outperforms variants of MAML on two synthetic few-shot meta-learning problems.

* 14 pages (4 main, 8 supplement), under review

Via

Access Paper or Ask Questions

Making Efficient Use of Demonstrations to Solve Hard Exploration Problems

Sep 03, 2019

Tom Le Paine, Caglar Gulcehre, Bobak Shahriari, Misha Denil, Matt Hoffman, Hubert Soyer, Richard Tanburn, Steven Kapturowski, Neil Rabinowitz, Duncan Williams(+4 more)

Figure 1 for Making Efficient Use of Demonstrations to Solve Hard Exploration Problems

Figure 2 for Making Efficient Use of Demonstrations to Solve Hard Exploration Problems

Figure 3 for Making Efficient Use of Demonstrations to Solve Hard Exploration Problems

Figure 4 for Making Efficient Use of Demonstrations to Solve Hard Exploration Problems

Abstract:This paper introduces R2D3, an agent that makes efficient use of demonstrations to solve hard exploration problems in partially observable environments with highly variable initial conditions. We also introduce a suite of eight tasks that combine these three properties, and show that R2D3 can solve several of the tasks where other state of the art methods (both with and without demonstrations) fail to see even a single successful trajectory after tens of billions of steps of exploration.

Via

Access Paper or Ask Questions

Learning Compositional Neural Programs with Recursive Tree Search and Planning

May 30, 2019

Thomas Pierrot, Guillaume Ligner, Scott Reed, Olivier Sigaud, Nicolas Perrin, Alexandre Laterre, David Kas, Karim Beguir, Nando de Freitas

Figure 1 for Learning Compositional Neural Programs with Recursive Tree Search and Planning

Figure 2 for Learning Compositional Neural Programs with Recursive Tree Search and Planning

Figure 3 for Learning Compositional Neural Programs with Recursive Tree Search and Planning

Figure 4 for Learning Compositional Neural Programs with Recursive Tree Search and Planning

Abstract:We propose a novel reinforcement learning algorithm, AlphaNPI, that incorporates the strengths of Neural Programmer-Interpreters (NPI) and AlphaZero. NPI contributes structural biases in the form of modularity, hierarchy and recursion, which are helpful to reduce sample complexity, improve generalization and increase interpretability. AlphaZero contributes powerful neural network guided search algorithms, which we augment with recursion. AlphaNPI only assumes a hierarchical program specification with sparse rewards: 1 when the program execution satisfies the specification, and 0 otherwise. Using this specification, AlphaNPI is able to train NPI models effectively with RL for the first time, completely eliminating the need for strong supervision in the form of execution traces. The experiments show that AlphaNPI can sort as well as previous strongly supervised NPI variants. The AlphaNPI agent is also trained on a Tower of Hanoi puzzle with two disks and is shown to generalize to puzzles with an arbitrary number of disk

Via

Access Paper or Ask Questions

Meta-learning of Sequential Strategies

May 08, 2019

Pedro A. Ortega, Jane X. Wang, Mark Rowland, Tim Genewein, Zeb Kurth-Nelson, Razvan Pascanu, Nicolas Heess, Joel Veness, Alex Pritzel, Pablo Sprechmann(+14 more)

Figure 1 for Meta-learning of Sequential Strategies

Figure 2 for Meta-learning of Sequential Strategies

Figure 3 for Meta-learning of Sequential Strategies

Figure 4 for Meta-learning of Sequential Strategies

Abstract:In this report we review memory-based meta-learning as a tool for building sample-efficient strategies that learn from past experience to adapt to any task within a target class. Our goal is to equip the reader with the conceptual foundations of this tool for building new, scalable agents that operate on broad domains. To do so, we present basic algorithmic templates for building near-optimal predictors and reinforcement learners which behave as if they had a probabilistic model that allowed them to efficiently exploit task structure. Furthermore, we recast memory-based meta-learning within a Bayesian framework, showing that the meta-learned strategies are near-optimal because they amortize Bayes-filtered data, where the adaptation is implemented in the memory dynamics as a state-machine of sufficient statistics. Essentially, memory-based meta-learning translates the hard problem of probabilistic sequential inference into a regression problem.

* DeepMind Technical Report (15 pages, 6 figures)

Via

Access Paper or Ask Questions

Bayesian Optimization in AlphaGo

Dec 17, 2018

Yutian Chen, Aja Huang, Ziyu Wang, Ioannis Antonoglou, Julian Schrittwieser, David Silver, Nando de Freitas

Figure 1 for Bayesian Optimization in AlphaGo

Figure 2 for Bayesian Optimization in AlphaGo

Figure 3 for Bayesian Optimization in AlphaGo

Figure 4 for Bayesian Optimization in AlphaGo

Abstract:During the development of AlphaGo, its many hyper-parameters were tuned with Bayesian optimization multiple times. This automatic tuning process resulted in substantial improvements in playing strength. For example, prior to the match with Lee Sedol, we tuned the latest AlphaGo agent and this improved its win-rate from 50% to 66.5% in self-play games. This tuned version was deployed in the final match. Of course, since we tuned AlphaGo many times during its development cycle, the compounded contribution was even higher than this percentage. It is our hope that this brief case study will be of interest to Go fans, and also provide Bayesian optimization practitioners with some insights and inspiration.

Via

Access Paper or Ask Questions

Intrinsic Social Motivation via Causal Influence in Multi-Agent RL

Oct 19, 2018

Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro A. Ortega, DJ Strouse, Joel Z. Leibo, Nando de Freitas

Figure 1 for Intrinsic Social Motivation via Causal Influence in Multi-Agent RL

Figure 2 for Intrinsic Social Motivation via Causal Influence in Multi-Agent RL

Figure 3 for Intrinsic Social Motivation via Causal Influence in Multi-Agent RL

Figure 4 for Intrinsic Social Motivation via Causal Influence in Multi-Agent RL

Abstract:We derive a new intrinsic social motivation for multi-agent reinforcement learning (MARL), in which agents are rewarded for having causal influence over another agent's actions. Causal influence is assessed using counterfactual reasoning. The reward does not depend on observing another agent's reward function, and is thus a more realistic approach to MARL than taken in previous work. We show that the causal influence reward is related to maximizing the mutual information between agents' actions. We test the approach in challenging social dilemma environments, where it consistently leads to enhanced cooperation between agents and higher collective reward. Moreover, we find that rewarding influence can lead agents to develop emergent communication protocols. We therefore employ influence to train agents to use an explicit communication channel, and find that it leads to more effective communication and higher collective reward. Finally, we show that influence can be computed by equipping each agent with an internal model that predicts the actions of other agents. This allows the social influence reward to be computed without the use of a centralised controller, and as such represents a significantly more general and scalable inductive bias for MARL with independent agents.

Via

Access Paper or Ask Questions

One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL

Oct 11, 2018

Tom Le Paine, Sergio Gómez Colmenarejo, Ziyu Wang, Scott Reed, Yusuf Aytar, Tobias Pfaff, Matt W. Hoffman, Gabriel Barth-Maron, Serkan Cabi, David Budden(+1 more)

Figure 1 for One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL

Figure 2 for One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL

Figure 3 for One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL

Figure 4 for One-Shot High-Fidelity Imitation: Training Large-Scale Deep Nets with RL

Abstract:Humans are experts at high-fidelity imitation -- closely mimicking a demonstration, often in one attempt. Humans use this ability to quickly solve a task instance, and to bootstrap learning of new tasks. Achieving these abilities in autonomous agents is an open problem. In this paper, we introduce an off-policy RL algorithm (MetaMimic) to narrow this gap. MetaMimic can learn both (i) policies for high-fidelity one-shot imitation of diverse novel skills, and (ii) policies that enable the agent to solve tasks more efficiently than the demonstrators. MetaMimic relies on the principle of storing all experiences in a memory and replaying these to learn massive deep neural network policies by off-policy RL. This paper introduces, to the best of our knowledge, the largest existing neural networks for deep RL and shows that larger networks with normalization are needed to achieve one-shot high-fidelity imitation on a challenging manipulation task. The results also show that both types of policy can be learned from vision, in spite of the task rewards being sparse, and without access to demonstrator actions.

Via

Access Paper or Ask Questions