Picture for Adam Gleave

Adam Gleave

Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban

Add code
Jun 11, 2025
Viaarxiv icon

Preference Learning with Lie Detectors can Induce Honesty or Evasion

Add code
May 20, 2025
Viaarxiv icon

Multi-Agent Risks from Advanced AI

Add code
Feb 19, 2025
Viaarxiv icon

Scaling Laws for Data Poisoning in LLMs

Add code
Aug 06, 2024
Viaarxiv icon

Exploring Scaling Trends in LLM Robustness

Add code
Jul 26, 2024
Figure 1 for Exploring Scaling Trends in LLM Robustness
Figure 2 for Exploring Scaling Trends in LLM Robustness
Figure 3 for Exploring Scaling Trends in LLM Robustness
Figure 4 for Exploring Scaling Trends in LLM Robustness
Viaarxiv icon

Planning behavior in a recurrent neural network that plays Sokoban

Add code
Jul 22, 2024
Viaarxiv icon

Can Go AIs be adversarially robust?

Add code
Jun 18, 2024
Viaarxiv icon

Uncovering Latent Human Wellbeing in Language Model Embeddings

Add code
Feb 19, 2024
Figure 1 for Uncovering Latent Human Wellbeing in Language Model Embeddings
Figure 2 for Uncovering Latent Human Wellbeing in Language Model Embeddings
Figure 3 for Uncovering Latent Human Wellbeing in Language Model Embeddings
Figure 4 for Uncovering Latent Human Wellbeing in Language Model Embeddings
Viaarxiv icon

Exploiting Novel GPT-4 APIs

Add code
Dec 21, 2023
Viaarxiv icon

STARC: A General Framework For Quantifying Differences Between Reward Functions

Add code
Sep 26, 2023
Viaarxiv icon