Alert button
Picture for Stuart Russell

Stuart Russell

Alert button

When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Add code
Bookmark button
Alert button
Mar 03, 2024
Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons

Viaarxiv icon

When Your AI Deceives You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Add code
Bookmark button
Alert button
Feb 27, 2024
Leon Lang, Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, Scott Emmons

Viaarxiv icon

Avoiding Catastrophe in Continuous Spaces by Asking for Help

Add code
Bookmark button
Alert button
Feb 12, 2024
Benjamin Plaut, Hanlin Zhu, Stuart Russell

Viaarxiv icon

ALMANACS: A Simulatability Benchmark for Language Model Explainability

Add code
Bookmark button
Alert button
Dec 20, 2023
Edmund Mills, Shiye Su, Stuart Russell, Scott Emmons

Viaarxiv icon

The Effective Horizon Explains Deep RL Performance in Stochastic Environments

Add code
Bookmark button
Alert button
Dec 13, 2023
Cassidy Laidlaw, Banghua Zhu, Stuart Russell, Anca Dragan

Viaarxiv icon

Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

Add code
Bookmark button
Alert button
Nov 02, 2023
Sam Toyer, Olivia Watkins, Ethan Adrian Mendes, Justin Svegliato, Luke Bailey, Tiffany Wang, Isaac Ong, Karim Elmaaroufi, Pieter Abbeel, Trevor Darrell, Alan Ritter, Stuart Russell

Viaarxiv icon

Managing AI Risks in an Era of Rapid Progress

Add code
Bookmark button
Alert button
Oct 26, 2023
Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann

Viaarxiv icon

Active teacher selection for reinforcement learning from human feedback

Add code
Bookmark button
Alert button
Oct 23, 2023
Rachel Freedman, Justin Svegliato, Kyle Wray, Stuart Russell

Viaarxiv icon

On Representation Complexity of Model-based and Model-free Reinforcement Learning

Add code
Bookmark button
Alert button
Oct 03, 2023
Hanlin Zhu, Baihe Huang, Stuart Russell

Figure 1 for On Representation Complexity of Model-based and Model-free Reinforcement Learning
Figure 2 for On Representation Complexity of Model-based and Model-free Reinforcement Learning
Figure 3 for On Representation Complexity of Model-based and Model-free Reinforcement Learning
Figure 4 for On Representation Complexity of Model-based and Model-free Reinforcement Learning
Viaarxiv icon

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Add code
Bookmark button
Alert button
Sep 18, 2023
Luke Bailey, Euan Ong, Stuart Russell, Scott Emmons

Figure 1 for Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Figure 2 for Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Figure 3 for Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Figure 4 for Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Viaarxiv icon