Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wulong Liu

Cooperative Multi-Agent Transfer Learning with Level-Adaptive Credit Assignment

Jun 03, 2021

Tianze Zhou, Fubiao Zhang, Kun Shao, Kai Li, Wenhan Huang, Jun Luo, Weixun Wang, Yaodong Yang, Hangyu Mao, Bin Wang(+3 more)

Figure 1 for Cooperative Multi-Agent Transfer Learning with Level-Adaptive Credit Assignment

Figure 2 for Cooperative Multi-Agent Transfer Learning with Level-Adaptive Credit Assignment

Figure 3 for Cooperative Multi-Agent Transfer Learning with Level-Adaptive Credit Assignment

Figure 4 for Cooperative Multi-Agent Transfer Learning with Level-Adaptive Credit Assignment

Abstract:Extending transfer learning to cooperative multi-agent reinforcement learning (MARL) has recently received much attention. In contrast to the single-agent setting, the coordination indispensable in cooperative MARL constrains each agent's policy. However, existing transfer methods focus exclusively on agent policy and ignores coordination knowledge. We propose a new architecture that realizes robust coordination knowledge transfer through appropriate decomposition of the overall coordination into several coordination patterns. We use a novel mixing network named level-adaptive QTransformer (LA-QTransformer) to realize agent coordination that considers credit assignment, with appropriate coordination patterns for different agents realized by a novel level-adaptive Transformer (LA-Transformer) dedicated to the transfer of coordination knowledge. In addition, we use a novel agent network named Population Invariant agent with Transformer (PIT) to realize the coordination transfer in more varieties of scenarios. Extensive experiments in StarCraft II micro-management show that LA-QTransformer together with PIT achieves superior performance compared with state-of-the-art baselines.

* 12 pages, 9 figures

Via

Access Paper or Ask Questions

Learning Symbolic Rules for Interpretable Deep Reinforcement Learning

Mar 16, 2021

Zhihao Ma, Yuzheng Zhuang, Paul Weng, Hankz Hankui Zhuo, Dong Li, Wulong Liu, Jianye Hao

Figure 1 for Learning Symbolic Rules for Interpretable Deep Reinforcement Learning

Figure 2 for Learning Symbolic Rules for Interpretable Deep Reinforcement Learning

Figure 3 for Learning Symbolic Rules for Interpretable Deep Reinforcement Learning

Figure 4 for Learning Symbolic Rules for Interpretable Deep Reinforcement Learning

Abstract:Recent progress in deep reinforcement learning (DRL) can be largely attributed to the use of neural networks. However, this black-box approach fails to explain the learned policy in a human understandable way. To address this challenge and improve the transparency, we propose a Neural Symbolic Reinforcement Learning framework by introducing symbolic logic into DRL. This framework features a fertilization of reasoning and learning modules, enabling end-to-end learning with prior symbolic knowledge. Moreover, interpretability is achieved by extracting the logical rules learned by the reasoning module in a symbolic rule space. The experimental results show that our framework has better interpretability, along with competing performance in comparison to state-of-the-art approaches.

Via

Access Paper or Ask Questions

Addressing Action Oscillations through Learning Policy Inertia

Mar 03, 2021

Chen Chen, Hongyao Tang, Jianye Hao, Wulong Liu, Zhaopeng Meng

Figure 1 for Addressing Action Oscillations through Learning Policy Inertia

Figure 2 for Addressing Action Oscillations through Learning Policy Inertia

Figure 3 for Addressing Action Oscillations through Learning Policy Inertia

Figure 4 for Addressing Action Oscillations through Learning Policy Inertia

Abstract:Deep reinforcement learning (DRL) algorithms have been demonstrated to be effective in a wide range of challenging decision making and control tasks. However, these methods typically suffer from severe action oscillations in particular in discrete action setting, which means that agents select different actions within consecutive steps even though states only slightly differ. This issue is often neglected since the policy is usually evaluated by its cumulative rewards only. Action oscillation strongly affects the user experience and can even cause serious potential security menace especially in real-world domains with the main concern of safety, such as autonomous driving. To this end, we introduce Policy Inertia Controller (PIC) which serves as a generic plug-in framework to off-the-shelf DRL algorithms, to enables adaptive trade-off between the optimality and smoothness of the learned policy in a formal way. We propose Nested Policy Iteration as a general training algorithm for PIC-augmented policy which ensures monotonically non-decreasing updates under some mild conditions. Further, we derive a practical DRL algorithm, namely Nested Soft Actor-Critic. Experiments on a collection of autonomous driving tasks and several Atari games suggest that our approach demonstrates substantial oscillation reduction in comparison to a range of commonly adopted baselines with almost no performance degradation.

* Accepted paper on AAAI 2021

Via

Access Paper or Ask Questions

Foresee then Evaluate: Decomposing Value Estimation with Latent Future Prediction

Mar 03, 2021

Hongyao Tang, Jianye Hao, Guangyong Chen, Pengfei Chen, Chen Chen, Yaodong Yang, Luo Zhang, Wulong Liu, Zhaopeng Meng

Figure 1 for Foresee then Evaluate: Decomposing Value Estimation with Latent Future Prediction

Figure 2 for Foresee then Evaluate: Decomposing Value Estimation with Latent Future Prediction

Figure 3 for Foresee then Evaluate: Decomposing Value Estimation with Latent Future Prediction

Figure 4 for Foresee then Evaluate: Decomposing Value Estimation with Latent Future Prediction

Abstract:Value function is the central notion of Reinforcement Learning (RL). Value estimation, especially with function approximation, can be challenging since it involves the stochasticity of environmental dynamics and reward signals that can be sparse and delayed in some cases. A typical model-free RL algorithm usually estimates the values of a policy by Temporal Difference (TD) or Monte Carlo (MC) algorithms directly from rewards, without explicitly taking dynamics into consideration. In this paper, we propose Value Decomposition with Future Prediction (VDFP), providing an explicit two-step understanding of the value estimation process: 1) first foresee the latent future, 2) and then evaluate it. We analytically decompose the value function into a latent future dynamics part and a policy-independent trajectory return part, inducing a way to model latent dynamics and returns separately in value estimation. Further, we derive a practical deep RL algorithm, consisting of a convolutional model to learn compact trajectory representation from past experiences, a conditional variational auto-encoder to predict the latent future dynamics and a convex return model that evaluates trajectory representation. In experiments, we empirically demonstrate the effectiveness of our approach for both off-policy and on-policy RL in several OpenAI Gym continuous control tasks as well as a few challenging variants with delayed reward.

* Accepted paper on AAAI 2021. arXiv admin note: text overlap with arXiv:1905.11100

Via

Access Paper or Ask Questions

SMARTS: Scalable Multi-Agent Reinforcement Learning Training School for Autonomous Driving

Nov 01, 2020

Ming Zhou, Jun Luo, Julian Villella, Yaodong Yang, David Rusu, Jiayu Miao, Weinan Zhang, Montgomery Alban, Iman Fadakar, Zheng Chen(+27 more)

Figure 1 for SMARTS: Scalable Multi-Agent Reinforcement Learning Training School for Autonomous Driving

Figure 2 for SMARTS: Scalable Multi-Agent Reinforcement Learning Training School for Autonomous Driving

Figure 3 for SMARTS: Scalable Multi-Agent Reinforcement Learning Training School for Autonomous Driving

Figure 4 for SMARTS: Scalable Multi-Agent Reinforcement Learning Training School for Autonomous Driving

Abstract:Multi-agent interaction is a fundamental aspect of autonomous driving in the real world. Despite more than a decade of research and development, the problem of how to competently interact with diverse road users in diverse scenarios remains largely unsolved. Learning methods have much to offer towards solving this problem. But they require a realistic multi-agent simulator that generates diverse and competent driving interactions. To meet this need, we develop a dedicated simulation platform called SMARTS (Scalable Multi-Agent RL Training School). SMARTS supports the training, accumulation, and use of diverse behavior models of road users. These are in turn used to create increasingly more realistic and diverse interactions that enable deeper and broader research on multi-agent interaction. In this paper, we describe the design goals of SMARTS, explain its basic architecture and its key features, and illustrate its use through concrete multi-agent experiments on interactive scenarios. We open-source the SMARTS platform and the associated benchmark tasks and evaluation metrics to encourage and empower research on multi-agent learning for autonomous driving. Our code is available at https://github.com/huawei-noah/SMARTS.

* 20 pages, 11 figures. Paper accepted to CoRL 2020

Via

Access Paper or Ask Questions

What About Taking Policy as Input of Value Function: Policy-extended Value Function Approximator

Oct 19, 2020

Hongyao Tang, Zhaopeng Meng, Jianye HAO, Chen Chen, Daniel Graves, Dong Li, Wulong Liu, Yaodong Yang

Figure 1 for What About Taking Policy as Input of Value Function: Policy-extended Value Function Approximator

Figure 2 for What About Taking Policy as Input of Value Function: Policy-extended Value Function Approximator

Figure 3 for What About Taking Policy as Input of Value Function: Policy-extended Value Function Approximator

Figure 4 for What About Taking Policy as Input of Value Function: Policy-extended Value Function Approximator

Abstract:The value function lies in the heart of Reinforcement Learning (RL), which defines the long-term evaluation of a policy in a given state. In this paper, we propose Policy-extended Value Function Approximator (PeVFA) which extends the conventional value to be not only a function of state but also an explicit policy representation. Such an extension enables PeVFA to preserve values of multiple policies in contrast to a conventional one with limited capacity for only one policy, inducing the new characteristic of \emph{value generalization among policies}. From both the theoretical and empirical lens, we study value generalization along the policy improvement path (called local generalization), from which we derive a new form of Generalized Policy Iteration with PeVFA to improve the conventional learning process. Besides, we propose a framework to learn the representation of an RL policy, studying several different approaches to learn an effective policy representation from policy network parameters and state-action pairs through contrastive learning and action prediction. In our experiments, Proximal Policy Optimization (PPO) with PeVFA significantly outperforms its vanilla counterpart in MuJoCo continuous control tasks, demonstrating the effectiveness of value generalization offered by PeVFA and policy representation learning.

* Preprint version

Via

Access Paper or Ask Questions

Towards Effective Context for Meta-Reinforcement Learning: an Approach based on Contrastive Learning

Oct 07, 2020

Haotian Fu, Hongyao Tang, Jianye Hao, Chen Chen, Xidong Feng, Dong Li, Wulong Liu

Figure 1 for Towards Effective Context for Meta-Reinforcement Learning: an Approach based on Contrastive Learning

Figure 2 for Towards Effective Context for Meta-Reinforcement Learning: an Approach based on Contrastive Learning

Figure 3 for Towards Effective Context for Meta-Reinforcement Learning: an Approach based on Contrastive Learning

Figure 4 for Towards Effective Context for Meta-Reinforcement Learning: an Approach based on Contrastive Learning

Abstract:Context, the embedding of previous collected trajectories, is a powerful construct for Meta-Reinforcement Learning (Meta-RL) algorithms. By conditioning on an effective context, Meta-RL policies can easily generalize to new tasks within a few adaptation steps. We argue that improving the quality of context involves answering two questions: 1. How to train a compact and sufficient encoder that can embed the task-specific information contained in prior trajectories? 2. How to collect informative trajectories of which the corresponding context reflects the specification of tasks? To this end, we propose a novel Meta-RL framework called CCM (Contrastive learning augmented Context-based Meta-RL). We first focus on the contrastive nature behind different tasks and leverage it to train a compact and sufficient context encoder. Further, we train a separate exploration policy and theoretically derive a new information-gain-based objective which aims to collect informative trajectories in a few steps. Empirically, we evaluate our approaches on common benchmarks as well as several complex sparse-reward environments. The experimental results show that CCM outperforms state-of-the-art algorithms by addressing previously mentioned problems respectively.

Via

Access Paper or Ask Questions

Triple-GAIL: A Multi-Modal Imitation Learning Framework with Generative Adversarial Nets

May 22, 2020

Cong Fei, Bin Wang, Yuzheng Zhuang, Zongzhang Zhang, Jianye Hao, Hongbo Zhang, Xuewu Ji, Wulong Liu

Figure 1 for Triple-GAIL: A Multi-Modal Imitation Learning Framework with Generative Adversarial Nets

Figure 2 for Triple-GAIL: A Multi-Modal Imitation Learning Framework with Generative Adversarial Nets

Figure 3 for Triple-GAIL: A Multi-Modal Imitation Learning Framework with Generative Adversarial Nets

Figure 4 for Triple-GAIL: A Multi-Modal Imitation Learning Framework with Generative Adversarial Nets

Abstract:Generative adversarial imitation learning (GAIL) has shown promising results by taking advantage of generative adversarial nets, especially in the field of robot learning. However, the requirement of isolated single modal demonstrations limits the scalability of the approach to real world scenarios such as autonomous vehicles' demand for a proper understanding of human drivers' behavior. In this paper, we propose a novel multi-modal GAIL framework, named Triple-GAIL, that is able to learn skill selection and imitation jointly from both expert demonstrations and continuously generated experiences with data augmentation purpose by introducing an auxiliary skill selector. We provide theoretical guarantees on the convergence to optima for both of the generator and the selector respectively. Experiments on real driver trajectories and real-time strategy game datasets demonstrate that Triple-GAIL can better fit multi-modal behaviors close to the demonstrators and outperforms state-of-the-art methods.

* 7 papges, 3 figures

Via

Access Paper or Ask Questions

Multi-Agent Interactions Modeling with Correlated Policies

Jan 20, 2020

Minghuan Liu, Ming Zhou, Weinan Zhang, Yuzheng Zhuang, Jun Wang, Wulong Liu, Yong Yu

Figure 1 for Multi-Agent Interactions Modeling with Correlated Policies

Figure 2 for Multi-Agent Interactions Modeling with Correlated Policies

Figure 3 for Multi-Agent Interactions Modeling with Correlated Policies

Figure 4 for Multi-Agent Interactions Modeling with Correlated Policies

Abstract:In multi-agent systems, complex interacting behaviors arise due to the high correlations among agents. However, previous work on modeling multi-agent interactions from demonstrations is primarily constrained by assuming the independence among policies and their reward structures. In this paper, we cast the multi-agent interactions modeling problem into a multi-agent imitation learning framework with explicit modeling of correlated policies by approximating opponents' policies, which can recover agents' policies that can regenerate similar interactions. Consequently, we develop a Decentralized Adversarial Imitation Learning algorithm with Correlated policies (CoDAIL), which allows for decentralized training and execution. Various experiments demonstrate that CoDAIL can better regenerate complex interactions close to the demonstrators and outperforms state-of-the-art multi-agent imitation learning methods. Our code is available at \url{https://github.com/apexrl/CoDAIL}.

* 20 pages (10 pages of supplementary), 5 figures, Accepted by The Eighth International Conference on Learning Representations (ICLR 2020)

Via

Access Paper or Ask Questions

Neighborhood Cognition Consistent Multi-Agent Reinforcement Learning

Dec 03, 2019

Hangyu Mao, Wulong Liu, Jianye Hao, Jun Luo, Dong Li, Zhengchao Zhang, Jun Wang, Zhen Xiao

Figure 1 for Neighborhood Cognition Consistent Multi-Agent Reinforcement Learning

Figure 2 for Neighborhood Cognition Consistent Multi-Agent Reinforcement Learning

Figure 3 for Neighborhood Cognition Consistent Multi-Agent Reinforcement Learning

Figure 4 for Neighborhood Cognition Consistent Multi-Agent Reinforcement Learning

Abstract:Social psychology and real experiences show that cognitive consistency plays an important role to keep human society in order: if people have a more consistent cognition about their environments, they are more likely to achieve better cooperation. Meanwhile, only cognitive consistency within a neighborhood matters because humans only interact directly with their neighbors. Inspired by these observations, we take the first step to introduce \emph{neighborhood cognitive consistency} (NCC) into multi-agent reinforcement learning (MARL). Our NCC design is quite general and can be easily combined with existing MARL methods. As examples, we propose neighborhood cognition consistent deep Q-learning and Actor-Critic to facilitate large-scale multi-agent cooperations. Extensive experiments on several challenging tasks (i.e., packet routing, wifi configuration, and Google football player control) justify the superior performance of our methods compared with state-of-the-art MARL approaches.

* accepted as a regular paper with oral presentation @ AAAI20

Via

Access Paper or Ask Questions