Alert button

Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking

Dec 21, 2023
Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, Peter Shaw, Jonathan Berant

Share this with someone who'll enjoy it:

View paper onarxiv icon

Share this with someone who'll enjoy it: