Picture for Zhaoran Wang

Zhaoran Wang

Hindsight Planner: A Closed-Loop Few-Shot Planner for Embodied Instruction Following

Add code
Dec 27, 2024
Figure 1 for Hindsight Planner: A Closed-Loop Few-Shot Planner for Embodied Instruction Following
Figure 2 for Hindsight Planner: A Closed-Loop Few-Shot Planner for Embodied Instruction Following
Figure 3 for Hindsight Planner: A Closed-Loop Few-Shot Planner for Embodied Instruction Following
Figure 4 for Hindsight Planner: A Closed-Loop Few-Shot Planner for Embodied Instruction Following
Viaarxiv icon

An Instrumental Value for Data Production and its Application to Data Pricing

Add code
Dec 24, 2024
Viaarxiv icon

DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs

Add code
Nov 20, 2024
Figure 1 for DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs
Figure 2 for DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs
Figure 3 for DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs
Figure 4 for DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs
Viaarxiv icon

Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos

Add code
Oct 11, 2024
Figure 1 for Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos
Figure 2 for Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos
Figure 3 for Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos
Figure 4 for Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos
Viaarxiv icon

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Add code
Oct 10, 2024
Figure 1 for Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Figure 2 for Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Figure 3 for Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Figure 4 for Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
Viaarxiv icon

Just say what you want: only-prompting self-rewarding online preference optimization

Add code
Sep 26, 2024
Figure 1 for Just say what you want: only-prompting self-rewarding online preference optimization
Figure 2 for Just say what you want: only-prompting self-rewarding online preference optimization
Figure 3 for Just say what you want: only-prompting self-rewarding online preference optimization
Figure 4 for Just say what you want: only-prompting self-rewarding online preference optimization
Viaarxiv icon

Safe MPC Alignment with Human Directional Feedback

Add code
Jul 05, 2024
Figure 1 for Safe MPC Alignment with Human Directional Feedback
Figure 2 for Safe MPC Alignment with Human Directional Feedback
Figure 3 for Safe MPC Alignment with Human Directional Feedback
Figure 4 for Safe MPC Alignment with Human Directional Feedback
Viaarxiv icon

Toward Optimal LLM Alignments Using Two-Player Games

Add code
Jun 16, 2024
Figure 1 for Toward Optimal LLM Alignments Using Two-Player Games
Figure 2 for Toward Optimal LLM Alignments Using Two-Player Games
Figure 3 for Toward Optimal LLM Alignments Using Two-Player Games
Figure 4 for Toward Optimal LLM Alignments Using Two-Player Games
Viaarxiv icon

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

Add code
May 29, 2024
Figure 1 for Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
Figure 2 for Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
Figure 3 for Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
Figure 4 for Self-Exploring Language Models: Active Preference Elicitation for Online Alignment
Viaarxiv icon

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Add code
May 26, 2024
Figure 1 for Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Figure 2 for Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Figure 3 for Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Figure 4 for Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Viaarxiv icon