Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shimon Whiteson

Semi-On-Policy Training for Sample Efficient Multi-Agent Policy Gradients

May 06, 2021
Bozhidar Vasilev, Tarun Gupta, Bei Peng, Shimon Whiteson

Figure 1 for Semi-On-Policy Training for Sample Efficient Multi-Agent Policy Gradients

Figure 2 for Semi-On-Policy Training for Sample Efficient Multi-Agent Policy Gradients

Figure 3 for Semi-On-Policy Training for Sample Efficient Multi-Agent Policy Gradients

Figure 4 for Semi-On-Policy Training for Sample Efficient Multi-Agent Policy Gradients

Policy gradient methods are an attractive approach to multi-agent reinforcement learning problems due to their convergence properties and robustness in partially observable scenarios. However, there is a significant performance gap between state-of-the-art policy gradient and value-based methods on the popular StarCraft Multi-Agent Challenge (SMAC) benchmark. In this paper, we introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods. We enhance two state-of-the-art policy gradient algorithms with SOP training, demonstrating significant performance improvements. Furthermore, we show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.

* AAMAS Adaptive and Learning Agents Workshop. 20th International Conference on Autonomous Agents and Multiagent Systems

Via

Access Paper or Ask Questions

Softmax with Regularization: Better Value Estimation in Multi-Agent Reinforcement Learning

Mar 22, 2021
Ling Pan, Tabish Rashid, Bei Peng, Longbo Huang, Shimon Whiteson

Figure 1 for Softmax with Regularization: Better Value Estimation in Multi-Agent Reinforcement Learning

Figure 2 for Softmax with Regularization: Better Value Estimation in Multi-Agent Reinforcement Learning

Figure 3 for Softmax with Regularization: Better Value Estimation in Multi-Agent Reinforcement Learning

Figure 4 for Softmax with Regularization: Better Value Estimation in Multi-Agent Reinforcement Learning

Overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning, but has received comparatively little attention in the multi-agent setting. In this work, we empirically demonstrate that QMIX, a popular $Q$-learning algorithm for cooperative multi-agent reinforcement learning (MARL), suffers from a particularly severe overestimation problem which is not mitigated by existing approaches. We rectify this by designing a novel regularization-based update scheme that penalizes large joint action-values deviating from a baseline and demonstrate its effectiveness in stabilizing learning. We additionally propose to employ a softmax operator, which we efficiently approximate in the multi-agent setting, to further reduce the potential overestimation bias. We demonstrate that our Softmax with Regularization (SR) method, when applied to QMIX, accomplishes its goal of avoiding severe overestimation and significantly improves performance in a variety of cooperative multi-agent tasks. To demonstrate the versatility of our method, we apply it to other $Q$-learning based MARL algorithms and achieve similar performance gains. Finally, we show that our method provides a consistent performance improvement on a set of challenging StarCraft II micromanagement tasks.

Via

Access Paper or Ask Questions

Snowflake: Scaling GNNs to High-Dimensional Continuous Control via Parameter Freezing

Mar 01, 2021
Charlie Blake, Vitaly Kurin, Maximilian Igl, Shimon Whiteson

Figure 1 for Snowflake: Scaling GNNs to High-Dimensional Continuous Control via Parameter Freezing

Figure 2 for Snowflake: Scaling GNNs to High-Dimensional Continuous Control via Parameter Freezing

Figure 3 for Snowflake: Scaling GNNs to High-Dimensional Continuous Control via Parameter Freezing

Figure 4 for Snowflake: Scaling GNNs to High-Dimensional Continuous Control via Parameter Freezing

Recent research has shown that Graph Neural Networks (GNNs) can learn policies for locomotion control that are as effective as a typical multi-layer perceptron (MLP), with superior transfer and multi-task performance (Wang et al., 2018; Huang et al., 2020). Results have so far been limited to training on small agents, with the performance of GNNs deteriorating rapidly as the number of sensors and actuators grows. A key motivation for the use of GNNs in the supervised learning setting is their applicability to large graphs, but this benefit has not yet been realised for locomotion control. We identify the weakness with a common GNN architecture that causes this poor scaling: overfitting in the MLPs within the network that encode, decode, and propagate messages. To combat this, we introduce Snowflake, a GNN training method for high-dimensional continuous control that freezes parameters in parts of the network that suffer from overfitting. Snowflake significantly boosts the performance of GNNs for locomotion control on large agents, now matching the performance of MLPs, and with superior transfer properties.

* 13 pages, 12 figures, submitted to ICML 2021

Via

Access Paper or Ask Questions

Breaking the Deadly Triad with a Target Network

Feb 09, 2021
Shangtong Zhang, Hengshuai Yao, Shimon Whiteson

Figure 1 for Breaking the Deadly Triad with a Target Network

Figure 2 for Breaking the Deadly Triad with a Target Network

Figure 3 for Breaking the Deadly Triad with a Target Network

Figure 4 for Breaking the Deadly Triad with a Target Network

The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning, function approximation, and bootstrapping simultaneously. In this paper, we investigate the target network as a tool for breaking the deadly triad, providing theoretical support for the conventional wisdom that a target network stabilizes training. We first propose and analyze a novel target network update rule which augments the commonly used Polyak-averaging style update with two projections. We then apply the target network and ridge regularization in several divergent algorithms and show their convergence to regularized TD fixed points. Those algorithms are off-policy with linear function approximation and bootstrapping, spanning both policy evaluation and control, as well as both discounted and average-reward settings. In particular, we provide the first convergent linear $Q$-learning algorithms under nonrestrictive and changing behavior policies without bi-level optimization.

Via

Access Paper or Ask Questions

Deep Interactive Bayesian Reinforcement Learning via Meta-Learning

Jan 11, 2021
Luisa Zintgraf, Sam Devlin, Kamil Ciosek, Shimon Whiteson, Katja Hofmann

Figure 1 for Deep Interactive Bayesian Reinforcement Learning via Meta-Learning

Figure 2 for Deep Interactive Bayesian Reinforcement Learning via Meta-Learning

Figure 3 for Deep Interactive Bayesian Reinforcement Learning via Meta-Learning

Figure 4 for Deep Interactive Bayesian Reinforcement Learning via Meta-Learning

Agents that interact with other agents often do not know a priori what the other agents' strategies are, but have to maximise their own online return while interacting with and learning about others. The optimal adaptive behaviour under uncertainty over the other agents' strategies w.r.t. some prior can in principle be computed using the Interactive Bayesian Reinforcement Learning framework. Unfortunately, doing so is intractable in most settings, and existing approximation methods are restricted to small tasks. To overcome this, we propose to meta-learn approximate belief inference and Bayes-optimal behaviour for a given prior. To model beliefs over other agents, we combine sequential and hierarchical Variational Auto-Encoders, and meta-train this inference model alongside the policy. We show empirically that our approach outperforms existing methods that use a model-free approach, sample from the approximate posterior, maintain memory-free models of others, or do not fully utilise the known structure of the environment.

Via

Access Paper or Ask Questions

Average-Reward Off-Policy Policy Evaluation with Function Approximation

Jan 08, 2021
Shangtong Zhang, Yi Wan, Richard S. Sutton, Shimon Whiteson

Figure 1 for Average-Reward Off-Policy Policy Evaluation with Function Approximation

Figure 2 for Average-Reward Off-Policy Policy Evaluation with Function Approximation

Figure 3 for Average-Reward Off-Policy Policy Evaluation with Function Approximation

Figure 4 for Average-Reward Off-Policy Policy Evaluation with Function Approximation

We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.

Via

Access Paper or Ask Questions

Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?

Nov 18, 2020
Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip H. S. Torr, Mingfei Sun, Shimon Whiteson

Figure 1 for Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?

Figure 2 for Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?

Figure 3 for Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?

Figure 4 for Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?

Most recently developed approaches to cooperative multi-agent reinforcement learning in the \emph{centralized training with decentralized execution} setting involve estimating a centralized, joint value function. In this paper, we demonstrate that, despite its various theoretical shortcomings, Independent PPO (IPPO), a form of independent learning in which each agent simply estimates its local value function, can perform just as well as or better than state-of-the-art joint learning approaches on popular multi-agent benchmark suite SMAC with little hyperparameter tuning. We also compare IPPO to several variants; the results suggest that IPPO's strong performance may be due to its robustness to some forms of environment non-stationarity.

Via

Access Paper or Ask Questions