Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marc Lanctot

Variance Reduction in Monte Carlo Counterfactual Regret Minimization (VR-MCCFR) for Extensive Form Games using Baselines

Sep 09, 2018

Martin Schmid, Neil Burch, Marc Lanctot, Matej Moravcik, Rudolf Kadlec, Michael Bowling

Figure 1 for Variance Reduction in Monte Carlo Counterfactual Regret Minimization (VR-MCCFR) for Extensive Form Games using Baselines

Figure 2 for Variance Reduction in Monte Carlo Counterfactual Regret Minimization (VR-MCCFR) for Extensive Form Games using Baselines

Figure 3 for Variance Reduction in Monte Carlo Counterfactual Regret Minimization (VR-MCCFR) for Extensive Form Games using Baselines

Figure 4 for Variance Reduction in Monte Carlo Counterfactual Regret Minimization (VR-MCCFR) for Extensive Form Games using Baselines

Abstract:Learning strategies for imperfect information games from samples of interaction is a challenging problem. A common method for this setting, Monte Carlo Counterfactual Regret Minimization (MCCFR), can have slow long-term convergence rates due to high variance. In this paper, we introduce a variance reduction technique (VR-MCCFR) that applies to any sampling variant of MCCFR. Using this technique, per-iteration estimated values and updates are reformulated as a function of sampled values and state-action baselines, similar to their use in policy gradient reinforcement learning. The new formulation allows estimates to be bootstrapped from other estimates within the same episode, propagating the benefits of baselines along the sampled trajectory; the estimates remain unbiased even when bootstrapping from other estimates. Finally, we show that given a perfect baseline, the variance of the value estimates can be reduced to zero. Experimental evaluation shows that VR-MCCFR brings an order of magnitude speedup, while the empirical variance decreases by three orders of magnitude. The decreased variance allows for the first time CFR+ to be used with sampling, increasing the speedup to two orders of magnitude.

Via

Access Paper or Ask Questions

Emergent Communication through Negotiation

Apr 11, 2018

Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Z Leibo, Karl Tuyls, Stephen Clark

Figure 1 for Emergent Communication through Negotiation

Figure 2 for Emergent Communication through Negotiation

Figure 3 for Emergent Communication through Negotiation

Figure 4 for Emergent Communication through Negotiation

Abstract:Multi-agent reinforcement learning offers a way to study how communication could emerge in communities of agents needing to solve specific problems. In this paper, we study the emergence of communication in the negotiation environment, a semi-cooperative model of agent interaction. We introduce two communication protocols -- one grounded in the semantics of the game, and one which is \textit{a priori} ungrounded and is a form of cheap talk. We show that self-interested agents can use the pre-grounded communication channel to negotiate fairly, but are unable to effectively use the ungrounded channel. However, prosocial agents do learn to use cheap talk to find an optimal negotiating strategy, suggesting that cooperation is necessary for language to emerge. We also study communication behaviour in a setting where one agent interacts with agents in a community with different levels of prosociality and show how agent identifiability can aid negotiation.

* Published as a conference paper at ICLR 2018

Via

Access Paper or Ask Questions

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Dec 05, 2017

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel(+3 more)

Figure 1 for Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Figure 2 for Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Figure 3 for Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Figure 4 for Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Abstract:The game of chess is the most widely-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of self-play. In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains. Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case.

Via

Access Paper or Ask Questions

Deep Q-learning from Demonstrations

Nov 22, 2017

Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold(+4 more)

Figure 1 for Deep Q-learning from Demonstrations

Figure 2 for Deep Q-learning from Demonstrations

Figure 3 for Deep Q-learning from Demonstrations

Abstract:Deep reinforcement learning (RL) has achieved several high profile successes in difficult decision-making problems. However, these algorithms typically require a huge amount of data before they reach reasonable performance. In fact, their performance during learning can be extremely poor. This may be acceptable for a simulator, but it severely limits the applicability of deep RL to many real-world tasks, where the agent must learn in the real environment. In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism. DQfD works by combining temporal difference updates with supervised classification of the demonstrator's actions. We show that DQfD has better initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN) as it starts with better scores on the first million steps on 41 of 42 games and on average it takes PDD DQN 83 million steps to catch up to DQfD's performance. DQfD learns to out-perform the best demonstration given in 14 of 42 games. In addition, DQfD leverages human demonstrations to achieve state-of-the-art results for 11 games. Finally, we show that DQfD performs better than three related algorithms for incorporating demonstration data into DQN.

* Published at AAAI 2018. Previously on arxiv as "Learning from Demonstrations for Real World Reinforcement Learning"

Via

Access Paper or Ask Questions

A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning

Nov 07, 2017

Marc Lanctot, Vinicius Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Perolat, David Silver, Thore Graepel

Figure 1 for A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning

Figure 2 for A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning

Figure 3 for A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning

Figure 4 for A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning

Abstract:To achieve general intelligence, agents must learn how to interact with others in a shared environment: this is the challenge of multiagent reinforcement learning (MARL). The simplest form is independent reinforcement learning (InRL), where each agent treats its experience as part of its (non-stationary) environment. In this paper, we first observe that policies learned using InRL can overfit to the other agents' policies during training, failing to sufficiently generalize during execution. We introduce a new metric, joint-policy correlation, to quantify this effect. We describe an algorithm for general MARL, based on approximate best responses to mixtures of policies generated using deep reinforcement learning, and empirical game-theoretic analysis to compute meta-strategies for policy selection. The algorithm generalizes previous ones such as InRL, iterated best response, double oracle, and fictitious play. Then, we present a scalable implementation which reduces the memory requirement using decoupled meta-solvers. Finally, we demonstrate the generality of the resulting policies in two partially observable settings: gridworld coordination games and poker.

* Camera-ready copy of NIPS 2017 paper, including appendix

Via

Access Paper or Ask Questions

Value-Decomposition Networks For Cooperative Multi-Agent Learning

Jun 16, 2017

Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls(+1 more)

Figure 1 for Value-Decomposition Networks For Cooperative Multi-Agent Learning

Figure 2 for Value-Decomposition Networks For Cooperative Multi-Agent Learning

Abstract:We study the problem of cooperative multi-agent reinforcement learning with a single joint reward signal. This class of learning problems is difficult because of the often large combined action and observation spaces. In the fully centralized and decentralized approaches, we find the problem of spurious rewards and a phenomenon we call the "lazy agent" problem, which arises due to partial observability. We address these problems by training individual agents with a novel value decomposition network architecture, which learns to decompose the team value function into agent-wise value functions. We perform an experimental evaluation across a range of partially-observable multi-agent domains and show that learning such value-decompositions leads to superior results, in particular when combined with weight sharing, role information and information channels.

Via

Access Paper or Ask Questions

Memory-Efficient Backpropagation Through Time

Jun 10, 2016

Audrūnas Gruslys, Remi Munos, Ivo Danihelka, Marc Lanctot, Alex Graves

Figure 1 for Memory-Efficient Backpropagation Through Time

Figure 2 for Memory-Efficient Backpropagation Through Time

Figure 3 for Memory-Efficient Backpropagation Through Time

Figure 4 for Memory-Efficient Backpropagation Through Time

Abstract:We propose a novel approach to reduce memory consumption of the backpropagation through time (BPTT) algorithm when training recurrent neural networks (RNNs). Our approach uses dynamic programming to balance a trade-off between caching of intermediate results and recomputation. The algorithm is capable of tightly fitting within almost any user-set memory budget while finding an optimal execution policy minimizing the computational cost. Computational devices have limited memory capacity and maximizing a computational performance given a fixed memory budget is a practical use-case. We provide asymptotic computational upper bounds for various regimes. The algorithm is particularly effective for long sequences. For sequences of length 1000, our algorithm saves 95\% of memory usage while using only one third more time per iteration than the standard BPTT.

Via

Access Paper or Ask Questions

Convolution by Evolution: Differentiable Pattern Producing Networks

Jun 08, 2016

Chrisantha Fernando, Dylan Banarse, Malcolm Reynolds, Frederic Besse, David Pfau, Max Jaderberg, Marc Lanctot, Daan Wierstra

Figure 1 for Convolution by Evolution: Differentiable Pattern Producing Networks

Figure 2 for Convolution by Evolution: Differentiable Pattern Producing Networks

Figure 3 for Convolution by Evolution: Differentiable Pattern Producing Networks

Figure 4 for Convolution by Evolution: Differentiable Pattern Producing Networks

Abstract:In this work we introduce a differentiable version of the Compositional Pattern Producing Network, called the DPPN. Unlike a standard CPPN, the topology of a DPPN is evolved but the weights are learned. A Lamarckian algorithm, that combines evolution and learning, produces DPPNs to reconstruct an image. Our main result is that DPPNs can be evolved/trained to compress the weights of a denoising autoencoder from 157684 to roughly 200 parameters, while achieving a reconstruction accuracy comparable to a fully connected network with more than two orders of magnitude more parameters. The regularization ability of the DPPN allows it to rediscover (approximate) convolutional network architectures embedded within a fully connected architecture. Such convolutional architectures are the current state of the art for many computer vision applications, so it is satisfying that DPPNs are capable of discovering this structure rather than having to build it in by design. DPPNs exhibit better generalization when tested on the Omniglot dataset after being trained on MNIST, than directly encoded fully connected autoencoders. DPPNs are therefore a new framework for integrating learning and evolution.

Via

Access Paper or Ask Questions

Dueling Network Architectures for Deep Reinforcement Learning

Apr 05, 2016

Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, Nando de Freitas

Figure 1 for Dueling Network Architectures for Deep Reinforcement Learning

Figure 2 for Dueling Network Architectures for Deep Reinforcement Learning

Figure 3 for Dueling Network Architectures for Deep Reinforcement Learning

Figure 4 for Dueling Network Architectures for Deep Reinforcement Learning

Abstract:In recent years there have been many successes of using deep representations in reinforcement learning. Still, many of these applications use conventional architectures, such as convolutional networks, LSTMs, or auto-encoders. In this paper, we present a new neural network architecture for model-free reinforcement learning. Our dueling network represents two separate estimators: one for the state value function and one for the state-dependent action advantage function. The main benefit of this factoring is to generalize learning across actions without imposing any change to the underlying reinforcement learning algorithm. Our results show that this architecture leads to better policy evaluation in the presence of many similar-valued actions. Moreover, the dueling architecture enables our RL agent to outperform the state-of-the-art on the Atari 2600 domain.

* 15 pages, 5 figures, and 5 tables

Via

Access Paper or Ask Questions

Monte Carlo Tree Search with Heuristic Evaluations using Implicit Minimax Backups

Jun 19, 2014

Marc Lanctot, Mark H. M. Winands, Tom Pepels, Nathan R. Sturtevant

Figure 1 for Monte Carlo Tree Search with Heuristic Evaluations using Implicit Minimax Backups

Figure 2 for Monte Carlo Tree Search with Heuristic Evaluations using Implicit Minimax Backups

Figure 3 for Monte Carlo Tree Search with Heuristic Evaluations using Implicit Minimax Backups

Figure 4 for Monte Carlo Tree Search with Heuristic Evaluations using Implicit Minimax Backups

Abstract:Monte Carlo Tree Search (MCTS) has improved the performance of game engines in domains such as Go, Hex, and general game playing. MCTS has been shown to outperform classic alpha-beta search in games where good heuristic evaluations are difficult to obtain. In recent years, combining ideas from traditional minimax search in MCTS has been shown to be advantageous in some domains, such as Lines of Action, Amazons, and Breakthrough. In this paper, we propose a new way to use heuristic evaluations to guide the MCTS search by storing the two sources of information, estimated win rates and heuristic evaluations, separately. Rather than using the heuristic evaluations to replace the playouts, our technique backs them up implicitly during the MCTS simulations. These minimax values are then used to guide future simulations. We show that using implicit minimax backups leads to stronger play performance in Kalah, Breakthrough, and Lines of Action.

* 24 pages, 7 figures, 9 tables, expanded version of paper presented at IEEE Conference on Computational Intelligence and Games (CIG) 2014 conference

Via

Access Paper or Ask Questions