Picture for Asa Cooper Stickland

Asa Cooper Stickland

Async Control: Stress-testing Asynchronous Control Measures for LLM Agents

Add code
Dec 15, 2025
Viaarxiv icon

RepliBench: Evaluating the autonomous replication capabilities of language model agents

Add code
Apr 21, 2025
Viaarxiv icon

Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods

Add code
Nov 20, 2024
Figure 1 for Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods
Figure 2 for Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods
Figure 3 for Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods
Figure 4 for Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods
Viaarxiv icon

Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Add code
Jul 22, 2024
Figure 1 for Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Figure 2 for Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Figure 3 for Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Figure 4 for Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Viaarxiv icon

Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs

Add code
Jul 04, 2024
Viaarxiv icon

Steering Without Side Effects: Improving Post-Deployment Control of Language Models

Add code
Jun 21, 2024
Viaarxiv icon

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Add code
Nov 20, 2023
Figure 1 for GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Figure 2 for GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Figure 3 for GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Figure 4 for GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Viaarxiv icon

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

Add code
Sep 22, 2023
Figure 1 for The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Figure 2 for The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Figure 3 for The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Figure 4 for The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Viaarxiv icon

Taken out of context: On measuring situational awareness in LLMs

Add code
Sep 01, 2023
Viaarxiv icon

Robustification of Multilingual Language Models to Real-world Noise with Robust Contrastive Pretraining

Add code
Oct 10, 2022
Figure 1 for Robustification of Multilingual Language Models to Real-world Noise with Robust Contrastive Pretraining
Figure 2 for Robustification of Multilingual Language Models to Real-world Noise with Robust Contrastive Pretraining
Figure 3 for Robustification of Multilingual Language Models to Real-world Noise with Robust Contrastive Pretraining
Figure 4 for Robustification of Multilingual Language Models to Real-world Noise with Robust Contrastive Pretraining
Viaarxiv icon