Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Apr 11, 2023

Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, Fei Huang

Figure 1 for RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Figure 2 for RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Figure 3 for RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Figure 4 for RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Share this with someone who'll enjoy it:

Abstract:Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models with human preferences, significantly enhancing the quality of interactions between humans and these models. InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO). PPO, however, is sensitive to hyperparameters and requires a minimum of four models in its standard implementation, which makes it hard to train. In contrast, we propose a novel learning paradigm called RRHF, which scores responses generated by different sampling policies and learns to align them with human preferences through ranking loss. RRHF can efficiently align language model output probabilities with human preferences as robust as fine-tuning and it only needs 1 to 2 models during tuning. In addition, RRHF can be considered an extension of SFT and reward models while being simpler than PPO in terms of coding, model counts, and hyperparameters. The entire alignment process can be accomplished within a single RRHF training session. We evaluate RRHF using LLaMA and Alpaca on Helpful and Harmless data, demonstrating performance comparable to PPO.

* Codes available at https://github.com/GanjinZero/RRHF

View paper on

Share this with someone who'll enjoy it:

Title:RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Paper and Code