Picture for Brian Christian

Brian Christian

AI Assistance Reduces Persistence and Hurts Independent Performance

Add code
Apr 07, 2026
Viaarxiv icon

Reward Models Inherit Value Biases from Pretraining

Add code
Jan 28, 2026
Viaarxiv icon

Self-Blinding and Counterfactual Self-Simulation Mitigate Biases and Sycophancy in Large Language Models

Add code
Jan 21, 2026
Viaarxiv icon

Reward Model Interpretability via Optimal and Pessimal Tokens

Add code
Jun 08, 2025
Figure 1 for Reward Model Interpretability via Optimal and Pessimal Tokens
Figure 2 for Reward Model Interpretability via Optimal and Pessimal Tokens
Figure 3 for Reward Model Interpretability via Optimal and Pessimal Tokens
Figure 4 for Reward Model Interpretability via Optimal and Pessimal Tokens
Viaarxiv icon