Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Entropy-Regularized Token-Level Policy Optimization for Large Language Models

Feb 09, 2024

Muning Wen, Cheng Deng, Jun Wang, Weinan Zhang, Ying Wen

Figure 1 for Entropy-Regularized Token-Level Policy Optimization for Large Language Models

Figure 2 for Entropy-Regularized Token-Level Policy Optimization for Large Language Models

Figure 3 for Entropy-Regularized Token-Level Policy Optimization for Large Language Models

Figure 4 for Entropy-Regularized Token-Level Policy Optimization for Large Language Models

Share this with someone who'll enjoy it:

Abstract:Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. Traditional approaches often depend on meticulously designed prompts, high-quality examples, or additional reward models for in-context learning, supervised fine-tuning, or RLHF. Reinforcement learning (RL) presents a dynamic alternative for LLMs to overcome these dependencies by engaging directly with task-specific environments. Nonetheless, it faces significant hurdles: 1) instability stemming from the exponentially vast action space requiring exploration; 2) challenges in assigning token-level credit based on action-level reward signals, resulting in discord between maximizing rewards and accurately modeling corpus data. In response to these challenges, we introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. At the heart of ETPO is our novel per-token soft Bellman update, designed to harmonize the RL process with the principles of language modeling. This methodology decomposes the Q-function update from a coarse action-level view to a more granular token-level perspective, backed by theoretical proof of optimization consistency. Crucially, this decomposition renders linear time complexity in action exploration. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks; results show that ETPO achieves effective performance improvement on the CodeLlama-7B model and surpasses a variant PPO baseline inherited from RLHF. This underlines ETPO's potential as a robust method for refining the interactive decision-making capabilities of LLMs.

View paper on

Share this with someone who'll enjoy it:

Title:Entropy-Regularized Token-Level Policy Optimization for Large Language Models

Paper and Code