Picture for Alex McKenzie

Alex McKenzie

Moral Preferences of LLMs Under Directed Contextual Influence

Add code
Feb 26, 2026
Viaarxiv icon

Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

Add code
Feb 10, 2026
Viaarxiv icon

Endogenous Resistance to Activation Steering in Language Models

Add code
Feb 06, 2026
Viaarxiv icon

Detecting High-Stakes Interactions with Activation Probes

Add code
Jun 12, 2025
Figure 1 for Detecting High-Stakes Interactions with Activation Probes
Figure 2 for Detecting High-Stakes Interactions with Activation Probes
Figure 3 for Detecting High-Stakes Interactions with Activation Probes
Figure 4 for Detecting High-Stakes Interactions with Activation Probes
Viaarxiv icon