Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ying Wen

Aligning Individual and Collective Objectives in Multi-Agent Cooperation

Feb 19, 2024

Yang Li, Wenhao Zhang, Jianhong Wang, Shao Zhang, Yali Du, Ying Wen, Wei Pan

Figure 1 for Aligning Individual and Collective Objectives in Multi-Agent Cooperation

Figure 2 for Aligning Individual and Collective Objectives in Multi-Agent Cooperation

Figure 3 for Aligning Individual and Collective Objectives in Multi-Agent Cooperation

Figure 4 for Aligning Individual and Collective Objectives in Multi-Agent Cooperation

Abstract:In the field of multi-agent learning, the challenge of mixed-motive cooperation is pronounced, given the inherent contradictions between individual and collective goals. Current research in this domain primarily focuses on incorporating domain knowledge into rewards or introducing additional mechanisms to foster cooperation. However, many of these methods suffer from the drawbacks of manual design costs and the lack of a theoretical grounding convergence procedure to the solution. To address this gap, we approach the mixed-motive game by modeling it as a differentiable game to study learning dynamics. We introduce a novel optimization method named Altruistic Gradient Adjustment (AgA) that employs gradient adjustments to novelly align individual and collective objectives. Furthermore, we provide theoretical proof that the selection of an appropriate alignment weight in AgA can accelerate convergence towards the desired solutions while effectively avoiding the undesired ones. The visualization of learning dynamics effectively demonstrates that AgA successfully achieves alignment between individual and collective objectives. Additionally, through evaluations conducted on established mixed-motive benchmarks such as the public good game, Cleanup, Harvest, and our modified mixed-motive SMAC environment, we validate AgA's capability to facilitate altruistic and fair collaboration.

* 15 pages

Via

Access Paper or Ask Questions

Natural Language Reinforcement Learning

Feb 14, 2024

Xidong Feng, Ziyu Wan, Mengyue Yang, Ziyan Wang, Girish A. Koushik, Yali Du, Ying Wen, Jun Wang

Abstract:Reinforcement Learning (RL) has shown remarkable abilities in learning policies for decision-making tasks. However, RL is often hindered by issues such as low sample efficiency, lack of interpretability, and sparse supervision signals. To tackle these limitations, we take inspiration from the human learning process and introduce Natural Language Reinforcement Learning (NLRL), which innovatively combines RL principles with natural language representation. Specifically, NLRL redefines RL concepts like task objectives, policy, value function, Bellman equation, and policy iteration in natural language space. We present how NLRL can be practically implemented with the latest advancements in large language models (LLMs) like GPT-4. Initial experiments over tabular MDPs demonstrate the effectiveness, efficiency, and also interpretability of the NLRL framework.

* Work in Progress

Via

Access Paper or Ask Questions

Entropy-Regularized Token-Level Policy Optimization for Large Language Models

Feb 09, 2024

Muning Wen, Cheng Deng, Jun Wang, Weinan Zhang, Ying Wen

Figure 1 for Entropy-Regularized Token-Level Policy Optimization for Large Language Models

Figure 2 for Entropy-Regularized Token-Level Policy Optimization for Large Language Models

Figure 3 for Entropy-Regularized Token-Level Policy Optimization for Large Language Models

Figure 4 for Entropy-Regularized Token-Level Policy Optimization for Large Language Models

Abstract:Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. Traditional approaches often depend on meticulously designed prompts, high-quality examples, or additional reward models for in-context learning, supervised fine-tuning, or RLHF. Reinforcement learning (RL) presents a dynamic alternative for LLMs to overcome these dependencies by engaging directly with task-specific environments. Nonetheless, it faces significant hurdles: 1) instability stemming from the exponentially vast action space requiring exploration; 2) challenges in assigning token-level credit based on action-level reward signals, resulting in discord between maximizing rewards and accurately modeling corpus data. In response to these challenges, we introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. At the heart of ETPO is our novel per-token soft Bellman update, designed to harmonize the RL process with the principles of language modeling. This methodology decomposes the Q-function update from a coarse action-level view to a more granular token-level perspective, backed by theoretical proof of optimization consistency. Crucially, this decomposition renders linear time complexity in action exploration. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks; results show that ETPO achieves effective performance improvement on the CodeLlama-7B model and surpasses a variant PPO baseline inherited from RLHF. This underlines ETPO's potential as a robust method for refining the interactive decision-making capabilities of LLMs.

Via

Access Paper or Ask Questions

Adaptive Control Strategy for Quadruped Robots in Actuator Degradation Scenarios

Dec 29, 2023

Xinyuan Wu, Wentao Dong, Hang Lai, Yong Yu, Ying Wen

Figure 1 for Adaptive Control Strategy for Quadruped Robots in Actuator Degradation Scenarios

Figure 2 for Adaptive Control Strategy for Quadruped Robots in Actuator Degradation Scenarios

Figure 3 for Adaptive Control Strategy for Quadruped Robots in Actuator Degradation Scenarios

Figure 4 for Adaptive Control Strategy for Quadruped Robots in Actuator Degradation Scenarios

Abstract:Quadruped robots have strong adaptability to extreme environments but may also experience faults. Once these faults occur, robots must be repaired before returning to the task, reducing their practical feasibility. One prevalent concern among these faults is actuator degradation, stemming from factors like device aging or unexpected operational events. Traditionally, addressing this problem has relied heavily on intricate fault-tolerant design, which demands deep domain expertise from developers and lacks generalizability. Learning-based approaches offer effective ways to mitigate these limitations, but a research gap exists in effectively deploying such methods on real-world quadruped robots. This paper introduces a pioneering teacher-student framework rooted in reinforcement learning, named Actuator Degradation Adaptation Transformer (ADAPT), aimed at addressing this research gap. This framework produces a unified control strategy, enabling the robot to sustain its locomotion and perform tasks despite sudden joint actuator faults, relying exclusively on its internal sensors. Empirical evaluations on the Unitree A1 platform validate the deployability and effectiveness of Adapt on real-world quadruped robots, and affirm the robustness and practicality of our approach.

* 13 pages, 14 figures, in proceeding of DAI'23

Via

Access Paper or Ask Questions

Critic-Guided Decision Transformer for Offline Reinforcement Learning

Dec 21, 2023

Yuanfu Wang, Chao Yang, Ying Wen, Yu Liu, Yu Qiao

Figure 1 for Critic-Guided Decision Transformer for Offline Reinforcement Learning

Figure 2 for Critic-Guided Decision Transformer for Offline Reinforcement Learning

Figure 3 for Critic-Guided Decision Transformer for Offline Reinforcement Learning

Figure 4 for Critic-Guided Decision Transformer for Offline Reinforcement Learning

Abstract:Recent advancements in offline reinforcement learning (RL) have underscored the capabilities of Return-Conditioned Supervised Learning (RCSL), a paradigm that learns the action distribution based on target returns for each state in a supervised manner. However, prevailing RCSL methods largely focus on deterministic trajectory modeling, disregarding stochastic state transitions and the diversity of future trajectory distributions. A fundamental challenge arises from the inconsistency between the sampled returns within individual trajectories and the expected returns across multiple trajectories. Fortunately, value-based methods offer a solution by leveraging a value function to approximate the expected returns, thereby addressing the inconsistency effectively. Building upon these insights, we propose a novel approach, termed the Critic-Guided Decision Transformer (CGDT), which combines the predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer. By incorporating a learned value function, known as the critic, CGDT ensures a direct alignment between the specified target returns and the expected returns of actions. This integration bridges the gap between the deterministic nature of RCSL and the probabilistic characteristics of value-based methods. Empirical evaluations on stochastic environments and D4RL benchmark datasets demonstrate the superiority of CGDT over traditional RCSL methods. These results highlight the potential of CGDT to advance the state of the art in offline RL and extend the applicability of RCSL to a wide range of RL tasks.

* Accepted at AAAI 2024

Via

Access Paper or Ask Questions

Controlling Large Language Model-based Agents for Large-Scale Decision-Making: An Actor-Critic Approach

Nov 23, 2023

Bin Zhang, Hangyu Mao, Jingqing Ruan, Ying Wen, Yang Li, Shao Zhang, Zhiwei Xu, Dapeng Li, Ziyue Li, Rui Zhao(+2 more)

Figure 1 for Controlling Large Language Model-based Agents for Large-Scale Decision-Making: An Actor-Critic Approach

Figure 2 for Controlling Large Language Model-based Agents for Large-Scale Decision-Making: An Actor-Critic Approach

Figure 3 for Controlling Large Language Model-based Agents for Large-Scale Decision-Making: An Actor-Critic Approach

Figure 4 for Controlling Large Language Model-based Agents for Large-Scale Decision-Making: An Actor-Critic Approach

Abstract:The significant advancements in large language models (LLMs) have presented novel opportunities for tackling planning and decision-making within multi-agent systems. However, as the number of agents increases, the issues of hallucination in LLMs and coordination in multi-agent systems (MAS) have become increasingly pronounced. Additionally, the efficient utilization of tokens becomes a critical consideration when employing LLMs to facilitate the interactions of large numbers of agents. In this paper, we present a novel framework aimed at enhancing coordination and decision-making capabilities of LLMs within large-scale multi-agent environments. Our approach draws inspiration from the actor-critic framework employed in multi-agent reinforcement learning, and we develop a modular and token-efficient solution that effectively addresses challenges presented by LLMs and MAS. Through evaluations conducted in experiments involving system resource allocation and robot grid transportation, we demonstrate the considerable advantages afforded by our proposed approach.

* 11pages, 8 figures

Via

Access Paper or Ask Questions

Quantifying Zero-shot Coordination Capability with Behavior Preferring Partners

Oct 08, 2023

Xihuai Wang, Shao Zhang, Wenhao Zhang, Wentao Dong, Jingxiao Chen, Ying Wen, Weinan Zhang

Figure 1 for Quantifying Zero-shot Coordination Capability with Behavior Preferring Partners

Figure 2 for Quantifying Zero-shot Coordination Capability with Behavior Preferring Partners

Figure 3 for Quantifying Zero-shot Coordination Capability with Behavior Preferring Partners

Figure 4 for Quantifying Zero-shot Coordination Capability with Behavior Preferring Partners

Abstract:Zero-shot coordination (ZSC) is a new challenge focusing on generalizing learned coordination skills to unseen partners. Existing methods train the ego agent with partners from pre-trained or evolving populations. The agent's ZSC capability is typically evaluated with a few evaluation partners, including human and agent, and reported by mean returns. Current evaluation methods for ZSC capability still need to improve in constructing diverse evaluation partners and comprehensively measuring the ZSC capability. We aim to create a reliable, comprehensive, and efficient evaluation method for ZSC capability. We formally define the ideal 'diversity-complete' evaluation partners and propose the best response (BR) diversity, which is the population diversity of the BRs to the partners, to approximate the ideal evaluation partners. We propose an evaluation workflow including 'diversity-complete' evaluation partners construction and a multi-dimensional metric, the Best Response Proximity (BR-Prox) metric. BR-Prox quantifies the ZSC capability as the performance similarity to each evaluation partner's approximate best response, demonstrating generalization capability and improvement potential. We re-evaluate strong ZSC methods in the Overcooked environment using the proposed evaluation workflow. Surprisingly, the results in some of the most used layouts fail to distinguish the performance of different ZSC methods. Moreover, the evaluated ZSC methods must produce more diverse and high-performing training partners. Our proposed evaluation workflow calls for a change in how we efficiently evaluate ZSC methods as a supplement to human evaluation.

Via

Access Paper or Ask Questions

GEAR: A GPU-Centric Experience Replay System for Large Reinforcement Learning Models

Oct 08, 2023

Hanjing Wang, Man-Kit Sit, Congjie He, Ying Wen, Weinan Zhang, Jun Wang, Yaodong Yang, Luo Mai

Abstract:This paper introduces a distributed, GPU-centric experience replay system, GEAR, designed to perform scalable reinforcement learning (RL) with large sequence models (such as transformers). With such models, existing systems such as Reverb face considerable bottlenecks in memory, computation, and communication. GEAR, however, optimizes memory efficiency by enabling the memory resources on GPU servers (including host memory and device memory) to manage trajectory data. Furthermore, it facilitates decentralized GPU devices to expedite various trajectory selection strategies, circumventing computational bottlenecks. GEAR is equipped with GPU kernels capable of collecting trajectories using zero-copy access to host memory, along with remote-directed-memory access over InfiniBand, improving communication efficiency. Cluster experiments have shown that GEAR can achieve performance levels up to 6x greater than Reverb when training state-of-the-art large RL models. GEAR is open-sourced at https://github.com/bigrl-team/gear.

* ICML2023

Via

Access Paper or Ask Questions

Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training

Sep 29, 2023

Xidong Feng, Ziyu Wan, Muning Wen, Ying Wen, Weinan Zhang, Jun Wang

Figure 1 for Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training

Figure 2 for Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training

Figure 3 for Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training

Figure 4 for Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training

Abstract:Large language models (LLMs) typically employ sampling or beam search, accompanied by prompts such as Chain-of-Thought (CoT), to boost reasoning and decoding ability. Recent work like Tree-of-Thought (ToT) and Reasoning via Planning (RAP) aim to augment the reasoning capabilities of LLMs by utilizing tree-search algorithms to guide multi-step reasoning. These methods mainly focus on LLMs' reasoning ability during inference and heavily rely on human-designed prompts to activate LLM as a value function, which lacks general applicability and scalability. To address these limitations, we present an AlphaZero-like tree-search framework for LLMs (termed TS-LLM), systematically illustrating how tree-search with a learned value function can guide LLMs' decoding ability. TS-LLM distinguishes itself in two key ways: (1) Leveraging a learned value function, our approach can be generally applied to different tasks beyond reasoning (such as RLHF alignment), and LLMs of any size, without prompting advanced, large-scale models. (2) It can guide LLM's decoding during both inference and training. Empirical evaluations across reasoning, planning, and RLHF alignment tasks validate the effectiveness of TS-LLM, even on trees with a depth of 64.

Via

Access Paper or Ask Questions

Cross-Utterance Conditioned VAE for Speech Generation

Sep 08, 2023

Yang Li, Cheng Yu, Guangzhi Sun, Weiqin Zu, Zheng Tian, Ying Wen, Wei Pan, Chao Zhang, Jun Wang, Yang Yang(+1 more)

Figure 1 for Cross-Utterance Conditioned VAE for Speech Generation

Figure 2 for Cross-Utterance Conditioned VAE for Speech Generation

Figure 3 for Cross-Utterance Conditioned VAE for Speech Generation

Figure 4 for Cross-Utterance Conditioned VAE for Speech Generation

Abstract:Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.

* 13 pages;

Via

Access Paper or Ask Questions