Picture for Max Nadeau

Max Nadeau

Circuit Breaking: Removing Model Behaviors with Targeted Ablation

Add code
Sep 12, 2023
Viaarxiv icon

Benchmarks for Detecting Measurement Tampering

Add code
Sep 07, 2023
Figure 1 for Benchmarks for Detecting Measurement Tampering
Figure 2 for Benchmarks for Detecting Measurement Tampering
Figure 3 for Benchmarks for Detecting Measurement Tampering
Figure 4 for Benchmarks for Detecting Measurement Tampering
Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Jul 27, 2023
Figure 1 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 2 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 3 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 4 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Viaarxiv icon

Discovering Variable Binding Circuitry with Desiderata

Add code
Jul 07, 2023
Figure 1 for Discovering Variable Binding Circuitry with Desiderata
Figure 2 for Discovering Variable Binding Circuitry with Desiderata
Figure 3 for Discovering Variable Binding Circuitry with Desiderata
Figure 4 for Discovering Variable Binding Circuitry with Desiderata
Viaarxiv icon

One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features

Add code
Oct 11, 2021
Figure 1 for One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features
Figure 2 for One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features
Figure 3 for One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features
Figure 4 for One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features
Viaarxiv icon