Picture for Stuart Russell

Stuart Russell

Berkeley

Social Choice for AI Alignment: Dealing with Diverse Human Feedback

Add code
Apr 16, 2024
Figure 1 for Social Choice for AI Alignment: Dealing with Diverse Human Feedback
Figure 2 for Social Choice for AI Alignment: Dealing with Diverse Human Feedback
Figure 3 for Social Choice for AI Alignment: Dealing with Diverse Human Feedback
Figure 4 for Social Choice for AI Alignment: Dealing with Diverse Human Feedback
Viaarxiv icon

When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Add code
Mar 03, 2024
Figure 1 for When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning
Figure 2 for When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning
Figure 3 for When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning
Figure 4 for When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning
Viaarxiv icon

Avoiding Catastrophe in Continuous Spaces by Asking for Help

Add code
Feb 12, 2024
Viaarxiv icon

ALMANACS: A Simulatability Benchmark for Language Model Explainability

Add code
Dec 20, 2023
Figure 1 for ALMANACS: A Simulatability Benchmark for Language Model Explainability
Figure 2 for ALMANACS: A Simulatability Benchmark for Language Model Explainability
Figure 3 for ALMANACS: A Simulatability Benchmark for Language Model Explainability
Figure 4 for ALMANACS: A Simulatability Benchmark for Language Model Explainability
Viaarxiv icon

The Effective Horizon Explains Deep RL Performance in Stochastic Environments

Add code
Dec 13, 2023
Figure 1 for The Effective Horizon Explains Deep RL Performance in Stochastic Environments
Figure 2 for The Effective Horizon Explains Deep RL Performance in Stochastic Environments
Figure 3 for The Effective Horizon Explains Deep RL Performance in Stochastic Environments
Figure 4 for The Effective Horizon Explains Deep RL Performance in Stochastic Environments
Viaarxiv icon

Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game

Add code
Nov 02, 2023
Viaarxiv icon

Managing AI Risks in an Era of Rapid Progress

Add code
Oct 26, 2023
Viaarxiv icon

Active teacher selection for reinforcement learning from human feedback

Add code
Oct 23, 2023
Figure 1 for Active teacher selection for reinforcement learning from human feedback
Figure 2 for Active teacher selection for reinforcement learning from human feedback
Figure 3 for Active teacher selection for reinforcement learning from human feedback
Figure 4 for Active teacher selection for reinforcement learning from human feedback
Viaarxiv icon

On Representation Complexity of Model-based and Model-free Reinforcement Learning

Add code
Oct 03, 2023
Figure 1 for On Representation Complexity of Model-based and Model-free Reinforcement Learning
Figure 2 for On Representation Complexity of Model-based and Model-free Reinforcement Learning
Figure 3 for On Representation Complexity of Model-based and Model-free Reinforcement Learning
Figure 4 for On Representation Complexity of Model-based and Model-free Reinforcement Learning
Viaarxiv icon

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Add code
Sep 18, 2023
Figure 1 for Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Figure 2 for Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Figure 3 for Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Figure 4 for Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Viaarxiv icon