Picture for Tengyang Xie

Tengyang Xie

Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits

Add code
May 26, 2025
Viaarxiv icon

Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Add code
May 21, 2025
Viaarxiv icon

Correcting the Mythos of KL-Regularization: Direct Alignment without Overparameterization via Chi-squared Preference Optimization

Add code
Jul 18, 2024
Figure 1 for Correcting the Mythos of KL-Regularization: Direct Alignment without Overparameterization via Chi-squared Preference Optimization
Figure 2 for Correcting the Mythos of KL-Regularization: Direct Alignment without Overparameterization via Chi-squared Preference Optimization
Figure 3 for Correcting the Mythos of KL-Regularization: Direct Alignment without Overparameterization via Chi-squared Preference Optimization
Viaarxiv icon

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts

Add code
Jun 18, 2024
Figure 1 for Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts
Figure 2 for Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts
Viaarxiv icon

Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models

Add code
Jun 06, 2024
Figure 1 for Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models
Viaarxiv icon

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

Add code
May 31, 2024
Figure 1 for Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF
Figure 2 for Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF
Viaarxiv icon

Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data

Add code
Apr 23, 2024
Figure 1 for Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data
Figure 2 for Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data
Figure 3 for Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data
Figure 4 for Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data
Viaarxiv icon

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

Add code
Apr 04, 2024
Figure 1 for Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Figure 2 for Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Figure 3 for Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Figure 4 for Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences
Viaarxiv icon

Towards Principled Representation Learning from Videos for Reinforcement Learning

Add code
Mar 20, 2024
Figure 1 for Towards Principled Representation Learning from Videos for Reinforcement Learning
Figure 2 for Towards Principled Representation Learning from Videos for Reinforcement Learning
Figure 3 for Towards Principled Representation Learning from Videos for Reinforcement Learning
Figure 4 for Towards Principled Representation Learning from Videos for Reinforcement Learning
Viaarxiv icon

CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples

Add code
Feb 20, 2024
Figure 1 for CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples
Figure 2 for CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples
Figure 3 for CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples
Figure 4 for CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples
Viaarxiv icon