Alert button
Picture for Stephen Casper

Stephen Casper

Alert button

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

Add code
Bookmark button
Alert button
Jun 21, 2023
Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell

Figure 1 for Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Figure 2 for Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Figure 3 for Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Figure 4 for Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Viaarxiv icon

Benchmarking Interpretability Tools for Deep Neural Networks

Add code
Bookmark button
Alert button
Feb 08, 2023
Stephen Casper, Yuxiao Li, Jiawei Li, Tong Bu, Kevin Zhang, Dylan Hadfield-Menell

Figure 1 for Benchmarking Interpretability Tools for Deep Neural Networks
Figure 2 for Benchmarking Interpretability Tools for Deep Neural Networks
Figure 3 for Benchmarking Interpretability Tools for Deep Neural Networks
Figure 4 for Benchmarking Interpretability Tools for Deep Neural Networks
Viaarxiv icon

Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks

Add code
Bookmark button
Alert button
Nov 22, 2022
Stephen Casper, Kaivalya Hariharan, Dylan Hadfield-Menell

Figure 1 for Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks
Figure 2 for Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks
Figure 3 for Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks
Figure 4 for Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks
Viaarxiv icon

White-Box Adversarial Policies in Deep Reinforcement Learning

Add code
Bookmark button
Alert button
Sep 05, 2022
Stephen Casper, Dylan Hadfield-Menell, Gabriel Kreiman

Figure 1 for White-Box Adversarial Policies in Deep Reinforcement Learning
Figure 2 for White-Box Adversarial Policies in Deep Reinforcement Learning
Figure 3 for White-Box Adversarial Policies in Deep Reinforcement Learning
Figure 4 for White-Box Adversarial Policies in Deep Reinforcement Learning
Viaarxiv icon

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Add code
Bookmark button
Alert button
Jul 28, 2022
Tilman Räuker, Anson Ho, Stephen Casper, Dylan Hadfield-Menell

Figure 1 for Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
Figure 2 for Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
Figure 3 for Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
Figure 4 for Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
Viaarxiv icon

Detecting Modularity in Deep Neural Networks

Add code
Bookmark button
Alert button
Oct 13, 2021
Shlomi Hod, Stephen Casper, Daniel Filan, Cody Wild, Andrew Critch, Stuart Russell

Figure 1 for Detecting Modularity in Deep Neural Networks
Figure 2 for Detecting Modularity in Deep Neural Networks
Figure 3 for Detecting Modularity in Deep Neural Networks
Figure 4 for Detecting Modularity in Deep Neural Networks
Viaarxiv icon

One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features

Add code
Bookmark button
Alert button
Oct 11, 2021
Stephen Casper, Max Nadeau, Gabriel Kreiman

Figure 1 for One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features
Figure 2 for One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features
Figure 3 for One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features
Figure 4 for One Thing to Fool them All: Generating Interpretable, Universal, and Physically-Realizable Adversarial Features
Viaarxiv icon

Clusterability in Neural Networks

Add code
Bookmark button
Alert button
Mar 04, 2021
Daniel Filan, Stephen Casper, Shlomi Hod, Cody Wild, Andrew Critch, Stuart Russell

Figure 1 for Clusterability in Neural Networks
Figure 2 for Clusterability in Neural Networks
Figure 3 for Clusterability in Neural Networks
Figure 4 for Clusterability in Neural Networks
Viaarxiv icon