Alert button
Picture for Luke Marks

Luke Marks

Alert button

Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders

Add code
Bookmark button
Alert button
Oct 12, 2023
Luke Marks, Amir Abdullah, Luna Mendez, Rauno Arike, Philip Torr, Fazl Barez

Viaarxiv icon