Picture for Fabien Roger

Fabien Roger

Stress-Testing Capability Elicitation With Password-Locked Models

Add code
May 29, 2024
Viaarxiv icon

AI Control: Improving Safety Despite Intentional Subversion

Add code
Dec 14, 2023
Figure 1 for AI Control: Improving Safety Despite Intentional Subversion
Figure 2 for AI Control: Improving Safety Despite Intentional Subversion
Figure 3 for AI Control: Improving Safety Despite Intentional Subversion
Figure 4 for AI Control: Improving Safety Despite Intentional Subversion
Viaarxiv icon

Preventing Language Models From Hiding Their Reasoning

Add code
Oct 31, 2023
Viaarxiv icon

Benchmarks for Detecting Measurement Tampering

Add code
Sep 07, 2023
Figure 1 for Benchmarks for Detecting Measurement Tampering
Figure 2 for Benchmarks for Detecting Measurement Tampering
Figure 3 for Benchmarks for Detecting Measurement Tampering
Figure 4 for Benchmarks for Detecting Measurement Tampering
Viaarxiv icon

Large Language Models Sometimes Generate Purely Negatively-Reinforced Text

Add code
Jun 16, 2023
Figure 1 for Large Language Models Sometimes Generate Purely Negatively-Reinforced Text
Figure 2 for Large Language Models Sometimes Generate Purely Negatively-Reinforced Text
Figure 3 for Large Language Models Sometimes Generate Purely Negatively-Reinforced Text
Figure 4 for Large Language Models Sometimes Generate Purely Negatively-Reinforced Text
Viaarxiv icon

Language models are better than humans at next-token prediction

Add code
Dec 21, 2022
Figure 1 for Language models are better than humans at next-token prediction
Figure 2 for Language models are better than humans at next-token prediction
Figure 3 for Language models are better than humans at next-token prediction
Viaarxiv icon