Alert button
Picture for Fazl Barez

Fazl Barez

Alert button

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Add code
Bookmark button
Alert button
Feb 23, 2024
Clement Neo, Shay B. Cohen, Fazl Barez

Viaarxiv icon

Increasing Trust in Language Models through the Reuse of Verified Circuits

Add code
Bookmark button
Alert button
Feb 06, 2024
Philip Quirke, Clement Neo, Fazl Barez

Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Bookmark button
Alert button
Jan 17, 2024
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez

Viaarxiv icon

Large Language Models Relearn Removed Concepts

Add code
Bookmark button
Alert button
Jan 03, 2024
Michelle Lo, Shay B. Cohen, Fazl Barez

Figure 1 for Large Language Models Relearn Removed Concepts
Figure 2 for Large Language Models Relearn Removed Concepts
Figure 3 for Large Language Models Relearn Removed Concepts
Figure 4 for Large Language Models Relearn Removed Concepts
Viaarxiv icon

Measuring Value Alignment

Add code
Bookmark button
Alert button
Dec 23, 2023
Fazl Barez, Philip Torr

Viaarxiv icon

Locating Cross-Task Sequence Continuation Circuits in Transformers

Add code
Bookmark button
Alert button
Nov 07, 2023
Michael Lan, Fazl Barez

Viaarxiv icon

Understanding Addition in Transformers

Add code
Bookmark button
Alert button
Oct 23, 2023
Philip Quirke, Fazl Barez

Viaarxiv icon

Interpreting Reward Models in RLHF-Tuned Language Models Using Sparse Autoencoders

Add code
Bookmark button
Alert button
Oct 12, 2023
Luke Marks, Amir Abdullah, Luna Mendez, Rauno Arike, Philip Torr, Fazl Barez

Viaarxiv icon