Picture for Fabien Roger

Fabien Roger

Do Unlearning Methods Remove Information from Language Model Weights?

Add code
Oct 11, 2024
Viaarxiv icon

Stress-Testing Capability Elicitation With Password-Locked Models

Add code
May 29, 2024
Viaarxiv icon

AI Control: Improving Safety Despite Intentional Subversion

Add code
Dec 14, 2023
Viaarxiv icon

Preventing Language Models From Hiding Their Reasoning

Add code
Oct 31, 2023
Viaarxiv icon

Benchmarks for Detecting Measurement Tampering

Add code
Sep 07, 2023
Viaarxiv icon

Large Language Models Sometimes Generate Purely Negatively-Reinforced Text

Add code
Jun 16, 2023
Viaarxiv icon

Language models are better than humans at next-token prediction

Add code
Dec 21, 2022
Viaarxiv icon