Picture for Abhay Sheshadri

Abhay Sheshadri

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Add code
Feb 26, 2026
Viaarxiv icon

Obfuscated Activations Bypass LLM Latent-Space Defenses

Add code
Dec 12, 2024
Viaarxiv icon

Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization

Add code
Oct 16, 2024
Viaarxiv icon

Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Add code
Jul 22, 2024
Figure 1 for Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Figure 2 for Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Figure 3 for Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Figure 4 for Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
Viaarxiv icon

A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task

Add code
Feb 28, 2024
Figure 1 for A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task
Figure 2 for A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task
Figure 3 for A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task
Figure 4 for A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task
Viaarxiv icon