Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Deheng Ye

Tencent Inc

Honor of Kings Arena: an Environment for Generalization in Competitive Reinforcement Learning

Oct 09, 2022

Hua Wei, Jingxiao Chen, Xiyang Ji, Hongyang Qin, Minwen Deng, Siqin Li, Liang Wang, Weinan Zhang, Yong Yu, Lin Liu(+4 more)

Figure 1 for Honor of Kings Arena: an Environment for Generalization in Competitive Reinforcement Learning

Figure 2 for Honor of Kings Arena: an Environment for Generalization in Competitive Reinforcement Learning

Figure 3 for Honor of Kings Arena: an Environment for Generalization in Competitive Reinforcement Learning

Figure 4 for Honor of Kings Arena: an Environment for Generalization in Competitive Reinforcement Learning

Abstract:This paper introduces Honor of Kings Arena, a reinforcement learning (RL) environment based on Honor of Kings, one of the world's most popular games at present. Compared to other environments studied in most previous work, ours presents new generalization challenges for competitive reinforcement learning. It is a multi-agent problem with one agent competing against its opponent; and it requires the generalization ability as it has diverse targets to control and diverse opponents to compete with. We describe the observation, action, and reward specifications for the Honor of Kings domain and provide an open-source Python-based interface for communicating with the game engine. We provide twenty target heroes with a variety of tasks in Honor of Kings Arena and present initial baseline results for RL-based methods with feasible computing resources. Finally, we showcase the generalization challenges imposed by Honor of Kings Arena and possible remedies to the challenges. All of the software, including the environment-class, are publicly available at https://github.com/tencent-ailab/hok_env . The documentation is available at https://aiarena.tencent.com/hok/doc/ .

* Accepted by NeurIPS 2022

Via

Access Paper or Ask Questions

More Centralized Training, Still Decentralized Execution: Multi-Agent Conditional Policy Factorization

Sep 26, 2022

Jiangxing Wang, Deheng Ye, Zongqing Lu

Abstract:In cooperative multi-agent reinforcement learning (MARL), combining value decomposition with actor-critic enables agents to learn stochastic policies, which are more suitable for the partially observable environment. Given the goal of learning local policies that enable decentralized execution, agents are commonly assumed to be independent of each other, even in centralized training. However, such an assumption may prohibit agents from learning the optimal joint policy. To address this problem, we explicitly take the dependency among agents into centralized training. Although this leads to the optimal joint policy, it may not be factorized for decentralized execution. Nevertheless, we theoretically show that from such a joint policy, we can always derive another joint policy that achieves the same optimality but can be factorized for decentralized execution. To this end, we propose multi-agent conditional policy factorization (MACPF), which takes more centralized training but still enables decentralized execution. We empirically verify MACPF in various cooperative MARL tasks and demonstrate that MACPF achieves better performance or faster convergence than baselines.

* 18 pages

Via

Access Paper or Ask Questions

Revisiting Discrete Soft Actor-Critic

Sep 22, 2022

Haibin Zhou, Zichuan Lin, Junyou Li, Deheng Ye, Qiang Fu, Wei Yang

Figure 1 for Revisiting Discrete Soft Actor-Critic

Figure 2 for Revisiting Discrete Soft Actor-Critic

Figure 3 for Revisiting Discrete Soft Actor-Critic

Figure 4 for Revisiting Discrete Soft Actor-Critic

Abstract:We study the adaption of soft actor-critic (SAC) from continuous action space to discrete action space. We revisit vanilla SAC and provide an in-depth understanding of its Q value underestimation and performance instability issues when applied to discrete settings. We thereby propose entropy-penalty and double average Q-learning with Q-clip to address these issues. Extensive experiments on typical benchmarks with discrete action space, including Atari games and a large-scale MOBA game, show the efficacy of our proposed method. Our code is at:https://github.com/coldsummerday/Revisiting-Discrete-SAC.

Via

Access Paper or Ask Questions

Quantized Adaptive Subgradient Algorithms and Their Applications

Aug 11, 2022

Ke Xu, Jianqiao Wangni, Yifan Zhang, Deheng Ye, Jiaxiang Wu, Peilin Zhao

Figure 1 for Quantized Adaptive Subgradient Algorithms and Their Applications

Figure 2 for Quantized Adaptive Subgradient Algorithms and Their Applications

Figure 3 for Quantized Adaptive Subgradient Algorithms and Their Applications

Figure 4 for Quantized Adaptive Subgradient Algorithms and Their Applications

Abstract:Data explosion and an increase in model size drive the remarkable advances in large-scale machine learning, but also make model training time-consuming and model storage difficult. To address the above issues in the distributed model training setting which has high computation efficiency and less device limitation, there are still two main difficulties. On one hand, the communication costs for exchanging information, e.g., stochastic gradients among different workers, is a key bottleneck for distributed training efficiency. On the other hand, less parameter model is easy for storage and communication, but the risk of damaging the model performance. To balance the communication costs, model capacity and model performance simultaneously, we propose quantized composite mirror descent adaptive subgradient (QCMD adagrad) and quantized regularized dual average adaptive subgradient (QRDA adagrad) for distributed training. To be specific, we explore the combination of gradient quantization and sparse model to reduce the communication cost per iteration in distributed training. A quantized gradient-based adaptive learning rate matrix is constructed to achieve a balance between communication costs, accuracy, and model sparsity. Moreover, we theoretically find that a large quantization error brings in extra noise, which influences the convergence and sparsity of the model. Therefore, a threshold quantization strategy with a relatively small error is adopted in QCMD adagrad and QRDA adagrad to improve the signal-to-noise ratio and preserve the sparsity of the model. Both theoretical analyses and empirical results demonstrate the efficacy and efficiency of the proposed algorithms.

Via

Access Paper or Ask Questions

GPN: A Joint Structural Learning Framework for Graph Neural Networks

May 12, 2022

Qianggang Ding, Deheng Ye, Tingyang Xu, Peilin Zhao

Figure 1 for GPN: A Joint Structural Learning Framework for Graph Neural Networks

Figure 2 for GPN: A Joint Structural Learning Framework for Graph Neural Networks

Figure 3 for GPN: A Joint Structural Learning Framework for Graph Neural Networks

Figure 4 for GPN: A Joint Structural Learning Framework for Graph Neural Networks

Abstract:Graph neural networks (GNNs) have been applied into a variety of graph tasks. Most existing work of GNNs is based on the assumption that the given graph data is optimal, while it is inevitable that there exists missing or incomplete edges in the graph data for training, leading to degraded performance. In this paper, we propose Generative Predictive Network (GPN), a GNN-based joint learning framework that simultaneously learns the graph structure and the downstream task. Specifically, we develop a bilevel optimization framework for this joint learning task, in which the upper optimization (generator) and the lower optimization (predictor) are both instantiated with GNNs. To the best of our knowledge, our method is the first GNN-based bilevel optimization framework for resolving this task. Through extensive experiments, our method outperforms a wide range of baselines using benchmark datasets.

* 8 pages

Via

Access Paper or Ask Questions

MineRL Diamond 2021 Competition: Overview, Results, and Lessons Learned

Feb 17, 2022

Anssi Kanervisto, Stephanie Milani, Karolis Ramanauskas, Nicholay Topin, Zichuan Lin, Junyou Li, Jianing Shi, Deheng Ye, Qiang Fu, Wei Yang(+12 more)

Figure 1 for MineRL Diamond 2021 Competition: Overview, Results, and Lessons Learned

Figure 2 for MineRL Diamond 2021 Competition: Overview, Results, and Lessons Learned

Figure 3 for MineRL Diamond 2021 Competition: Overview, Results, and Lessons Learned

Figure 4 for MineRL Diamond 2021 Competition: Overview, Results, and Lessons Learned

Abstract:Reinforcement learning competitions advance the field by providing appropriate scope and support to develop solutions toward a specific problem. To promote the development of more broadly applicable methods, organizers need to enforce the use of general techniques, the use of sample-efficient methods, and the reproducibility of the results. While beneficial for the research community, these restrictions come at a cost -- increased difficulty. If the barrier for entry is too high, many potential participants are demoralized. With this in mind, we hosted the third edition of the MineRL ObtainDiamond competition, MineRL Diamond 2021, with a separate track in which we permitted any solution to promote the participation of newcomers. With this track and more extensive tutorials and support, we saw an increased number of submissions. The participants of this easier track were able to obtain a diamond, and the participants of the harder track progressed the generalizable solutions in the same task.

* Under review for PMLR volume on NeurIPS 2021 competitions

Via

Access Paper or Ask Questions

JueWu-MC: Playing Minecraft with Sample-efficient Hierarchical Reinforcement Learning

Dec 07, 2021

Zichuan Lin, Junyou Li, Jianing Shi, Deheng Ye, Qiang Fu, Wei Yang

Figure 1 for JueWu-MC: Playing Minecraft with Sample-efficient Hierarchical Reinforcement Learning

Figure 2 for JueWu-MC: Playing Minecraft with Sample-efficient Hierarchical Reinforcement Learning

Figure 3 for JueWu-MC: Playing Minecraft with Sample-efficient Hierarchical Reinforcement Learning

Figure 4 for JueWu-MC: Playing Minecraft with Sample-efficient Hierarchical Reinforcement Learning

Abstract:Learning rational behaviors in open-world games like Minecraft remains to be challenging for Reinforcement Learning (RL) research due to the compound challenge of partial observability, high-dimensional visual perception and delayed reward. To address this, we propose JueWu-MC, a sample-efficient hierarchical RL approach equipped with representation learning and imitation learning to deal with perception and exploration. Specifically, our approach includes two levels of hierarchy, where the high-level controller learns a policy to control over options and the low-level workers learn to solve each sub-task. To boost the learning of sub-tasks, we propose a combination of techniques including 1) action-aware representation learning which captures underlying relations between action and representation, 2) discriminator-based self-imitation learning for efficient exploration, and 3) ensemble behavior cloning with consistency filtering for policy robustness. Extensive experiments show that JueWu-MC significantly improves sample efficiency and outperforms a set of baselines by a large margin. Notably, we won the championship of the NeurIPS MineRL 2021 research competition and achieved the highest performance score ever.

* The champion solution of NeurIPS 2021 MineRL research competition ( https://www.aicrowd.com/challenges/neurips-2021-minerl-diamond-competition/leaderboards )

Via

Access Paper or Ask Questions

Coordinated Proximal Policy Optimization

Nov 07, 2021

Zifan Wu, Chao Yu, Deheng Ye, Junge Zhang, Haiyin Piao, Hankz Hankui Zhuo

Figure 1 for Coordinated Proximal Policy Optimization

Figure 2 for Coordinated Proximal Policy Optimization

Figure 3 for Coordinated Proximal Policy Optimization

Figure 4 for Coordinated Proximal Policy Optimization

Abstract:We present Coordinated Proximal Policy Optimization (CoPPO), an algorithm that extends the original Proximal Policy Optimization (PPO) to the multi-agent setting. The key idea lies in the coordinated adaptation of step size during the policy update process among multiple agents. We prove the monotonicity of policy improvement when optimizing a theoretically-grounded joint objective, and derive a simplified optimization objective based on a set of approximations. We then interpret that such an objective in CoPPO can achieve dynamic credit assignment among agents, thereby alleviating the high variance issue during the concurrent update of agent policies. Finally, we demonstrate that CoPPO outperforms several strong baselines and is competitive with the latest multi-agent PPO method (i.e. MAPPO) under typical multi-agent settings, including cooperative matrix games and the StarCraft II micromanagement tasks.

Via

Access Paper or Ask Questions

Learning Diverse Policies in MOBA Games via Macro-Goals

Oct 27, 2021

Yiming Gao, Bei Shi, Xueying Du, Liang Wang, Guangwei Chen, Zhenjie Lian, Fuhao Qiu, Guoan Han, Weixuan Wang, Deheng Ye(+3 more)

Figure 1 for Learning Diverse Policies in MOBA Games via Macro-Goals

Figure 2 for Learning Diverse Policies in MOBA Games via Macro-Goals

Figure 3 for Learning Diverse Policies in MOBA Games via Macro-Goals

Figure 4 for Learning Diverse Policies in MOBA Games via Macro-Goals

Abstract:Recently, many researchers have made successful progress in building the AI systems for MOBA-game-playing with deep reinforcement learning, such as on Dota 2 and Honor of Kings. Even though these AI systems have achieved or even exceeded human-level performance, they still suffer from the lack of policy diversity. In this paper, we propose a novel Macro-Goals Guided framework, called MGG, to learn diverse policies in MOBA games. MGG abstracts strategies as macro-goals from human demonstrations and trains a Meta-Controller to predict these macro-goals. To enhance policy diversity, MGG samples macro-goals from the Meta-Controller prediction and guides the training process towards these goals. Experimental results on the typical MOBA game Honor of Kings demonstrate that MGG can execute diverse policies in different matches and lineups, and also outperform the state-of-the-art methods over 102 heroes.

* Accepted at NeurIPS 2021

Via

Access Paper or Ask Questions

TiKick: Towards Playing Multi-agent Football Full Games from Single-agent Demonstrations

Oct 19, 2021

Shiyu Huang, Wenze Chen, Longfei Zhang, Ziyang Li, Fengming Zhu, Deheng Ye, Ting Chen, Jun Zhu

Figure 1 for TiKick: Towards Playing Multi-agent Football Full Games from Single-agent Demonstrations

Figure 2 for TiKick: Towards Playing Multi-agent Football Full Games from Single-agent Demonstrations

Figure 3 for TiKick: Towards Playing Multi-agent Football Full Games from Single-agent Demonstrations

Figure 4 for TiKick: Towards Playing Multi-agent Football Full Games from Single-agent Demonstrations

Abstract:Deep reinforcement learning (DRL) has achieved super-human performance on complex video games (e.g., StarCraft II and Dota II). However, current DRL systems still suffer from challenges of multi-agent coordination, sparse rewards, stochastic environments, etc. In seeking to address these challenges, we employ a football video game, e.g., Google Research Football (GRF), as our testbed and develop an end-to-end learning-based AI system (denoted as TiKick) to complete this challenging task. In this work, we first generated a large replay dataset from the self-playing of single-agent experts, which are obtained from league training. We then developed a distributed learning system and new offline algorithms to learn a powerful multi-agent AI from the fixed single-agent dataset. To the best of our knowledge, Tikick is the first learning-based AI system that can take over the multi-agent Google Research Football full game, while previous work could either control a single agent or experiment on toy academic scenarios. Extensive experiments further show that our pre-trained model can accelerate the training process of the modern multi-agent algorithm and our method achieves state-of-the-art performances on various academic scenarios.

Via

Access Paper or Ask Questions