Picture for Robin Young

Robin Young

Knowledge Divergence and the Value of Debate for Scalable Oversight

Add code
Mar 05, 2026
Viaarxiv icon

Why Is RLHF Alignment Shallow? A Gradient Analysis

Add code
Mar 05, 2026
Viaarxiv icon

Why Does RLAIF Work At All?

Add code
Mar 03, 2026
Viaarxiv icon

Calibrated Probabilistic Interpolation for GEDI Biomass

Add code
Jan 23, 2026
Viaarxiv icon

What is Harm? Baby Don't Hurt Me! On the Impossibility of Complete Harm Specification in AI Alignment

Add code
Jan 27, 2025
Viaarxiv icon

Token Democracy: The Architectural Limits of Alignment in Transformer-Based Language Models

Add code
Jan 26, 2025
Viaarxiv icon

Who's Driving? Game Theoretic Path Risk of AGI Development

Add code
Jan 25, 2025
Viaarxiv icon