Picture for Matthew Rahtz

Matthew Rahtz

A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI

Add code
Apr 23, 2024
Figure 1 for A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI
Figure 2 for A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI
Figure 3 for A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI
Figure 4 for A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI
Viaarxiv icon

Evaluating Frontier Models for Dangerous Capabilities

Add code
Mar 20, 2024
Figure 1 for Evaluating Frontier Models for Dangerous Capabilities
Figure 2 for Evaluating Frontier Models for Dangerous Capabilities
Figure 3 for Evaluating Frontier Models for Dangerous Capabilities
Figure 4 for Evaluating Frontier Models for Dangerous Capabilities
Viaarxiv icon

Gemini: A Family of Highly Capable Multimodal Models

Add code
Dec 19, 2023
Viaarxiv icon

The Hydra Effect: Emergent Self-repair in Language Model Computations

Add code
Jul 28, 2023
Figure 1 for The Hydra Effect: Emergent Self-repair in Language Model Computations
Figure 2 for The Hydra Effect: Emergent Self-repair in Language Model Computations
Figure 3 for The Hydra Effect: Emergent Self-repair in Language Model Computations
Figure 4 for The Hydra Effect: Emergent Self-repair in Language Model Computations
Viaarxiv icon

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Add code
Jul 24, 2023
Figure 1 for Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Figure 2 for Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Figure 3 for Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Figure 4 for Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Viaarxiv icon

Tracr: Compiled Transformers as a Laboratory for Interpretability

Add code
Jan 12, 2023
Figure 1 for Tracr: Compiled Transformers as a Laboratory for Interpretability
Figure 2 for Tracr: Compiled Transformers as a Laboratory for Interpretability
Figure 3 for Tracr: Compiled Transformers as a Laboratory for Interpretability
Figure 4 for Tracr: Compiled Transformers as a Laboratory for Interpretability
Viaarxiv icon

Safe Deep RL in 3D Environments using Human Feedback

Add code
Jan 21, 2022
Figure 1 for Safe Deep RL in 3D Environments using Human Feedback
Figure 2 for Safe Deep RL in 3D Environments using Human Feedback
Figure 3 for Safe Deep RL in 3D Environments using Human Feedback
Figure 4 for Safe Deep RL in 3D Environments using Human Feedback
Viaarxiv icon

An Extensible Interactive Interface for Agent Design

Add code
Jun 10, 2019
Figure 1 for An Extensible Interactive Interface for Agent Design
Figure 2 for An Extensible Interactive Interface for Agent Design
Figure 3 for An Extensible Interactive Interface for Agent Design
Figure 4 for An Extensible Interactive Interface for Agent Design
Viaarxiv icon