Picture for Evan Hubinger

Evan Hubinger

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Viaarxiv icon

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Add code
Apr 25, 2024
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

Steering Llama 2 via Contrastive Activation Addition

Add code
Dec 09, 2023
Figure 1 for Steering Llama 2 via Contrastive Activation Addition
Figure 2 for Steering Llama 2 via Contrastive Activation Addition
Figure 3 for Steering Llama 2 via Contrastive Activation Addition
Figure 4 for Steering Llama 2 via Contrastive Activation Addition
Viaarxiv icon

Studying Large Language Model Generalization with Influence Functions

Add code
Aug 07, 2023
Figure 1 for Studying Large Language Model Generalization with Influence Functions
Figure 2 for Studying Large Language Model Generalization with Influence Functions
Figure 3 for Studying Large Language Model Generalization with Influence Functions
Figure 4 for Studying Large Language Model Generalization with Influence Functions
Viaarxiv icon

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

Add code
Jul 25, 2023
Figure 1 for Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Figure 2 for Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Figure 3 for Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Figure 4 for Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Viaarxiv icon

Measuring Faithfulness in Chain-of-Thought Reasoning

Add code
Jul 17, 2023
Figure 1 for Measuring Faithfulness in Chain-of-Thought Reasoning
Figure 2 for Measuring Faithfulness in Chain-of-Thought Reasoning
Figure 3 for Measuring Faithfulness in Chain-of-Thought Reasoning
Figure 4 for Measuring Faithfulness in Chain-of-Thought Reasoning
Viaarxiv icon

Conditioning Predictive Models: Risks and Strategies

Add code
Feb 06, 2023
Figure 1 for Conditioning Predictive Models: Risks and Strategies
Figure 2 for Conditioning Predictive Models: Risks and Strategies
Figure 3 for Conditioning Predictive Models: Risks and Strategies
Figure 4 for Conditioning Predictive Models: Risks and Strategies
Viaarxiv icon

Discovering Language Model Behaviors with Model-Written Evaluations

Add code
Dec 19, 2022
Figure 1 for Discovering Language Model Behaviors with Model-Written Evaluations
Figure 2 for Discovering Language Model Behaviors with Model-Written Evaluations
Figure 3 for Discovering Language Model Behaviors with Model-Written Evaluations
Figure 4 for Discovering Language Model Behaviors with Model-Written Evaluations
Viaarxiv icon

Engineering Monosemanticity in Toy Models

Add code
Nov 16, 2022
Figure 1 for Engineering Monosemanticity in Toy Models
Figure 2 for Engineering Monosemanticity in Toy Models
Figure 3 for Engineering Monosemanticity in Toy Models
Figure 4 for Engineering Monosemanticity in Toy Models
Viaarxiv icon