Picture for Stuart Russell

Stuart Russell

Berkeley

Monitoring Latent World States in Language Models with Propositional Probes

Add code
Jun 27, 2024
Viaarxiv icon

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

Add code
Jun 02, 2024
Figure 1 for Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Figure 2 for Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Figure 3 for Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Figure 4 for Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
Viaarxiv icon

Diffusion On Syntax Trees For Program Synthesis

Add code
May 30, 2024
Viaarxiv icon

AI Alignment with Changing and Influenceable Reward Functions

Add code
May 28, 2024
Viaarxiv icon

Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

Add code
May 10, 2024
Viaarxiv icon

Social Choice for AI Alignment: Dealing with Diverse Human Feedback

Add code
Apr 16, 2024
Figure 1 for Social Choice for AI Alignment: Dealing with Diverse Human Feedback
Figure 2 for Social Choice for AI Alignment: Dealing with Diverse Human Feedback
Figure 3 for Social Choice for AI Alignment: Dealing with Diverse Human Feedback
Figure 4 for Social Choice for AI Alignment: Dealing with Diverse Human Feedback
Viaarxiv icon

When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Add code
Mar 03, 2024
Figure 1 for When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning
Figure 2 for When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning
Figure 3 for When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning
Figure 4 for When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning
Viaarxiv icon

Avoiding Catastrophe in Continuous Spaces by Asking for Help

Add code
Feb 12, 2024
Figure 1 for Avoiding Catastrophe in Continuous Spaces by Asking for Help
Viaarxiv icon

ALMANACS: A Simulatability Benchmark for Language Model Explainability

Add code
Dec 20, 2023
Viaarxiv icon

The Effective Horizon Explains Deep RL Performance in Stochastic Environments

Add code
Dec 13, 2023
Figure 1 for The Effective Horizon Explains Deep RL Performance in Stochastic Environments
Figure 2 for The Effective Horizon Explains Deep RL Performance in Stochastic Environments
Figure 3 for The Effective Horizon Explains Deep RL Performance in Stochastic Environments
Figure 4 for The Effective Horizon Explains Deep RL Performance in Stochastic Environments
Viaarxiv icon