Picture for Anca Dragan

Anca Dragan

CoS: Enhancing Personalization and Mitigating Bias with Context Steering

Add code
May 02, 2024
Figure 1 for CoS: Enhancing Personalization and Mitigating Bias with Context Steering
Figure 2 for CoS: Enhancing Personalization and Mitigating Bias with Context Steering
Figure 3 for CoS: Enhancing Personalization and Mitigating Bias with Context Steering
Figure 4 for CoS: Enhancing Personalization and Mitigating Bias with Context Steering
Viaarxiv icon

Evaluating Frontier Models for Dangerous Capabilities

Add code
Mar 20, 2024
Figure 1 for Evaluating Frontier Models for Dangerous Capabilities
Figure 2 for Evaluating Frontier Models for Dangerous Capabilities
Figure 3 for Evaluating Frontier Models for Dangerous Capabilities
Figure 4 for Evaluating Frontier Models for Dangerous Capabilities
Viaarxiv icon

A Generalized Acquisition Function for Preference-based Reward Learning

Add code
Mar 09, 2024
Figure 1 for A Generalized Acquisition Function for Preference-based Reward Learning
Figure 2 for A Generalized Acquisition Function for Preference-based Reward Learning
Viaarxiv icon

Preventing Reward Hacking with Occupancy Measure Regularization

Add code
Mar 05, 2024
Viaarxiv icon

When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Add code
Mar 03, 2024
Figure 1 for When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning
Figure 2 for When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning
Figure 3 for When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning
Figure 4 for When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning
Viaarxiv icon

The Effective Horizon Explains Deep RL Performance in Stochastic Environments

Add code
Dec 13, 2023
Figure 1 for The Effective Horizon Explains Deep RL Performance in Stochastic Environments
Figure 2 for The Effective Horizon Explains Deep RL Performance in Stochastic Environments
Figure 3 for The Effective Horizon Explains Deep RL Performance in Stochastic Environments
Figure 4 for The Effective Horizon Explains Deep RL Performance in Stochastic Environments
Viaarxiv icon

Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations

Add code
Nov 09, 2023
Viaarxiv icon

Offline RL with Observation Histories: Analyzing and Improving Sample Complexity

Add code
Oct 31, 2023
Viaarxiv icon

Managing AI Risks in an Era of Rapid Progress

Add code
Oct 26, 2023
Viaarxiv icon

Learning Optimal Advantage from Preferences and Mistaking it for Reward

Add code
Oct 03, 2023
Figure 1 for Learning Optimal Advantage from Preferences and Mistaking it for Reward
Figure 2 for Learning Optimal Advantage from Preferences and Mistaking it for Reward
Figure 3 for Learning Optimal Advantage from Preferences and Mistaking it for Reward
Figure 4 for Learning Optimal Advantage from Preferences and Mistaking it for Reward
Viaarxiv icon