Picture for Adam Karvonen

Adam Karvonen

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Add code
Dec 17, 2025
Figure 1 for Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Figure 2 for Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Figure 3 for Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Figure 4 for Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Viaarxiv icon

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Add code
Jul 22, 2025
Viaarxiv icon

Robustly Improving LLM Fairness in Realistic Settings via Interpretability

Add code
Jun 12, 2025
Figure 1 for Robustly Improving LLM Fairness in Realistic Settings via Interpretability
Figure 2 for Robustly Improving LLM Fairness in Realistic Settings via Interpretability
Figure 3 for Robustly Improving LLM Fairness in Realistic Settings via Interpretability
Figure 4 for Robustly Improving LLM Fairness in Realistic Settings via Interpretability
Viaarxiv icon

Revisiting End To End Sparse Autoencoder Training -- A Short Finetune is All You Need

Add code
Mar 21, 2025
Figure 1 for Revisiting End To End Sparse Autoencoder Training -- A Short Finetune is All You Need
Figure 2 for Revisiting End To End Sparse Autoencoder Training -- A Short Finetune is All You Need
Figure 3 for Revisiting End To End Sparse Autoencoder Training -- A Short Finetune is All You Need
Figure 4 for Revisiting End To End Sparse Autoencoder Training -- A Short Finetune is All You Need
Viaarxiv icon

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Add code
Mar 13, 2025
Viaarxiv icon

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

Add code
Nov 28, 2024
Figure 1 for Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks
Figure 2 for Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks
Figure 3 for Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks
Figure 4 for Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks
Viaarxiv icon

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Add code
Jul 31, 2024
Figure 1 for Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Figure 2 for Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Figure 3 for Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Figure 4 for Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Viaarxiv icon

Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models

Add code
Mar 21, 2024
Figure 1 for Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models
Figure 2 for Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models
Figure 3 for Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models
Figure 4 for Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models
Viaarxiv icon