Alert button
Picture for Max Nadeau

Max Nadeau

Alert button

Circuit Breaking: Removing Model Behaviors with Targeted Ablation

Add code
Bookmark button
Alert button
Sep 12, 2023
Maximilian Li, Xander Davies, Max Nadeau

Viaarxiv icon

Benchmarks for Detecting Measurement Tampering

Add code
Bookmark button
Alert button
Sep 07, 2023
Fabien Roger, Ryan Greenblatt, Max Nadeau, Buck Shlegeris, Nate Thomas

Figure 1 for Benchmarks for Detecting Measurement Tampering
Figure 2 for Benchmarks for Detecting Measurement Tampering
Figure 3 for Benchmarks for Detecting Measurement Tampering
Figure 4 for Benchmarks for Detecting Measurement Tampering
Viaarxiv icon

Measurement Tampering Detection Benchmark

Add code
Bookmark button
Alert button
Aug 29, 2023
Fabien Roger, Ryan Greenblatt, Max Nadeau, Buck Shlegeris, Nate Thomas

Figure 1 for Measurement Tampering Detection Benchmark
Figure 2 for Measurement Tampering Detection Benchmark
Figure 3 for Measurement Tampering Detection Benchmark
Figure 4 for Measurement Tampering Detection Benchmark
Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Bookmark button
Alert button
Jul 27, 2023
Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell

Figure 1 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 2 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 3 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 4 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Viaarxiv icon

Discovering Variable Binding Circuitry with Desiderata

Add code
Bookmark button
Alert button
Jul 07, 2023
Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, David Bau

Figure 1 for Discovering Variable Binding Circuitry with Desiderata
Figure 2 for Discovering Variable Binding Circuitry with Desiderata
Figure 3 for Discovering Variable Binding Circuitry with Desiderata
Figure 4 for Discovering Variable Binding Circuitry with Desiderata
Viaarxiv icon

One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features

Add code
Bookmark button
Alert button
Oct 11, 2021
Stephen Casper, Max Nadeau, Gabriel Kreiman

Figure 1 for One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features
Figure 2 for One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features
Figure 3 for One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features
Figure 4 for One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features
Viaarxiv icon