Picture for Brian Christian

Brian Christian

Reward Models Inherit Value Biases from Pretraining

Add code
Jan 28, 2026
Viaarxiv icon

Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models

Add code
Jan 21, 2026
Viaarxiv icon

Reward Model Interpretability via Optimal and Pessimal Tokens

Add code
Jun 08, 2025
Figure 1 for Reward Model Interpretability via Optimal and Pessimal Tokens
Figure 2 for Reward Model Interpretability via Optimal and Pessimal Tokens
Figure 3 for Reward Model Interpretability via Optimal and Pessimal Tokens
Figure 4 for Reward Model Interpretability via Optimal and Pessimal Tokens
Viaarxiv icon