Picture for Anca Dragan

Anca Dragan

Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback

Add code
Nov 04, 2024
Viaarxiv icon

Learning to Assist Humans without Inferring Rewards

Add code
Nov 04, 2024
Viaarxiv icon

Trajectory Improvement and Reward Learning from Comparative Language Feedback

Add code
Oct 08, 2024
Figure 1 for Trajectory Improvement and Reward Learning from Comparative Language Feedback
Figure 2 for Trajectory Improvement and Reward Learning from Comparative Language Feedback
Figure 3 for Trajectory Improvement and Reward Learning from Comparative Language Feedback
Figure 4 for Trajectory Improvement and Reward Learning from Comparative Language Feedback
Viaarxiv icon

Imagen 3

Add code
Aug 13, 2024
Viaarxiv icon

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Add code
Aug 09, 2024
Figure 1 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Figure 2 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Figure 3 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Figure 4 for Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Viaarxiv icon

Gemma 2: Improving Open Language Models at a Practical Size

Add code
Aug 02, 2024
Figure 1 for Gemma 2: Improving Open Language Models at a Practical Size
Figure 2 for Gemma 2: Improving Open Language Models at a Practical Size
Figure 3 for Gemma 2: Improving Open Language Models at a Practical Size
Figure 4 for Gemma 2: Improving Open Language Models at a Practical Size
Viaarxiv icon

Learning Temporal Distances: Contrastive Successor Features Can Provide a Metric Structure for Decision-Making

Add code
Jun 24, 2024
Viaarxiv icon

Adversaries Can Misuse Combinations of Safe Models

Add code
Jun 20, 2024
Figure 1 for Adversaries Can Misuse Combinations of Safe Models
Figure 2 for Adversaries Can Misuse Combinations of Safe Models
Figure 3 for Adversaries Can Misuse Combinations of Safe Models
Figure 4 for Adversaries Can Misuse Combinations of Safe Models
Viaarxiv icon

Coprocessor Actor Critic: A Model-Based Reinforcement Learning Approach For Adaptive Brain Stimulation

Add code
Jun 10, 2024
Figure 1 for Coprocessor Actor Critic: A Model-Based Reinforcement Learning Approach For Adaptive Brain Stimulation
Figure 2 for Coprocessor Actor Critic: A Model-Based Reinforcement Learning Approach For Adaptive Brain Stimulation
Figure 3 for Coprocessor Actor Critic: A Model-Based Reinforcement Learning Approach For Adaptive Brain Stimulation
Figure 4 for Coprocessor Actor Critic: A Model-Based Reinforcement Learning Approach For Adaptive Brain Stimulation
Viaarxiv icon

AI Alignment with Changing and Influenceable Reward Functions

Add code
May 28, 2024
Figure 1 for AI Alignment with Changing and Influenceable Reward Functions
Figure 2 for AI Alignment with Changing and Influenceable Reward Functions
Figure 3 for AI Alignment with Changing and Influenceable Reward Functions
Figure 4 for AI Alignment with Changing and Influenceable Reward Functions
Viaarxiv icon