Oops! No exact matches were found based on your query. Here are some results similar to "Vipo (variance-reduced Policy Optimization)":


LLMs Can Learn to Reason Via Off-Policy RL

Add code
Feb 22, 2026
Viaarxiv icon

A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization

Add code
Jan 30, 2026
Viaarxiv icon

Adaptive Replay Buffer for Offline-to-Online Reinforcement Learning

Add code
Dec 11, 2025
Viaarxiv icon

Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts

Add code
Aug 13, 2025
Viaarxiv icon

Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model

Add code
Jul 09, 2025
Figure 1 for Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model
Viaarxiv icon

VIPO: Value Function Inconsistency Penalized Offline Reinforcement Learning

Add code
Apr 16, 2025
Figure 1 for VIPO: Value Function Inconsistency Penalized Offline Reinforcement Learning
Viaarxiv icon

VPO: Leveraging the Number of Votes in Preference Optimization

Add code
Oct 30, 2024
Viaarxiv icon

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Add code
Oct 23, 2024
Figure 1 for Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Viaarxiv icon

Mixture of Attentions For Speculative Decoding

Add code
Oct 04, 2024
Viaarxiv icon

Revisiting Experience Replayable Conditions

Add code
Feb 15, 2024
Viaarxiv icon