Picture for Brian Christian

Brian Christian

Reward Model Interpretability via Optimal and Pessimal Tokens

Add code
Jun 08, 2025
Viaarxiv icon