Picture for Tsvetomira Dumbalska

Tsvetomira Dumbalska

Reward Model Interpretability via Optimal and Pessimal Tokens

Add code
Jun 08, 2025
Viaarxiv icon