Picture for Tsvetomira Dumbalska

Tsvetomira Dumbalska

Reward Models Inherit Value Biases from Pretraining

Add code
Jan 28, 2026
Viaarxiv icon

Reward Model Interpretability via Optimal and Pessimal Tokens

Add code
Jun 08, 2025
Figure 1 for Reward Model Interpretability via Optimal and Pessimal Tokens
Figure 2 for Reward Model Interpretability via Optimal and Pessimal Tokens
Figure 3 for Reward Model Interpretability via Optimal and Pessimal Tokens
Figure 4 for Reward Model Interpretability via Optimal and Pessimal Tokens
Viaarxiv icon