Alert button
Picture for Anca Dragan

Anca Dragan

Alert button

Evaluating Frontier Models for Dangerous Capabilities

Mar 20, 2024
Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca Dragan, Rohin Shah, Allan Dafoe, Toby Shevlane

Viaarxiv icon

A Generalized Acquisition Function for Preference-based Reward Learning

Mar 09, 2024
Evan Ellis, Gaurav R. Ghosal, Stuart J. Russell, Anca Dragan, Erdem Bıyık

Viaarxiv icon

Preventing Reward Hacking with Occupancy Measure Regularization

Mar 05, 2024
Cassidy Laidlaw, Shivam Singhal, Anca Dragan

Figure 1 for Preventing Reward Hacking with Occupancy Measure Regularization
Figure 2 for Preventing Reward Hacking with Occupancy Measure Regularization
Figure 3 for Preventing Reward Hacking with Occupancy Measure Regularization
Figure 4 for Preventing Reward Hacking with Occupancy Measure Regularization
Viaarxiv icon

When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Mar 03, 2024
Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons

Viaarxiv icon

When Your AI Deceives You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Feb 27, 2024
Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons

Viaarxiv icon

The Effective Horizon Explains Deep RL Performance in Stochastic Environments

Dec 13, 2023
Cassidy Laidlaw, Banghua Zhu, Stuart Russell, Anca Dragan

Viaarxiv icon

Zero-Shot Goal-Directed Dialogue via RL on Imagined Conversations

Nov 09, 2023
Joey Hong, Sergey Levine, Anca Dragan

Viaarxiv icon

Offline RL with Observation Histories: Analyzing and Improving Sample Complexity

Oct 31, 2023
Joey Hong, Anca Dragan, Sergey Levine

Viaarxiv icon

Managing AI Risks in an Era of Rapid Progress

Oct 26, 2023
Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann

Viaarxiv icon

Learning Optimal Advantage from Preferences and Mistaking it for Reward

Oct 03, 2023
W. Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson, Serena Booth, Anca Dragan, Peter Stone, Scott Niekum

Figure 1 for Learning Optimal Advantage from Preferences and Mistaking it for Reward
Figure 2 for Learning Optimal Advantage from Preferences and Mistaking it for Reward
Figure 3 for Learning Optimal Advantage from Preferences and Mistaking it for Reward
Figure 4 for Learning Optimal Advantage from Preferences and Mistaking it for Reward
Viaarxiv icon