Alert button
Picture for Luna Mendez

Luna Mendez

Alert button

Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders

Add code
Bookmark button
Alert button
Oct 12, 2023
Luke Marks, Amir Abdullah, Luna Mendez, Rauno Arike, Philip Torr, Fazl Barez

Viaarxiv icon