Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Oct 09, 2024

Yanjun Chen, Dawei Zhu, Yirong Sun, Xinghao Chen, Wei Zhang, Xiaoyu Shen

Figure 1 for The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Figure 2 for The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Figure 3 for The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Figure 4 for The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Share this with someone who'll enjoy it:

Abstract:Reinforcement Learning from Human Feedback significantly enhances Natural Language Processing by aligning language models with human expectations. A critical factor in this alignment is the strength of reward models used during training. This study explores whether stronger reward models invariably lead to better language models. In this paper, through experiments on relevance, factuality, and completeness tasks using the QA-FEEDBACK dataset and reward models based on Longformer, we uncover a surprising paradox: language models trained with moderately accurate reward models outperform those guided by highly accurate ones. This challenges the widely held belief that stronger reward models always lead to better language models, and opens up new avenues for future research into the key factors driving model performance and how to choose the most suitable reward models. Code and additional details are available at [https://github.com/EIT-NLP/AccuracyParadox-RLHF](https://github.com/EIT-NLP/AccuracyParadox-RLHF).

* 10 pages, 27 figures (including 18 in the appendix), submitted to EMNLP 2024

View paper on

Share this with someone who'll enjoy it:

Title:The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Paper and Code