Picture for Adrià Garriga-Alonso

Adrià Garriga-Alonso

Shammie

Hypothesis Testing the Circuit Hypothesis in LLMs

Add code
Oct 16, 2024
Viaarxiv icon

Planning behavior in a recurrent neural network that plays Sokoban

Add code
Jul 22, 2024
Viaarxiv icon

Adversarial Circuit Evaluation

Add code
Jul 21, 2024
Viaarxiv icon

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Add code
Jul 19, 2024
Viaarxiv icon

Investigating the Indirect Object Identification circuit in Mamb

Add code
Jul 19, 2024
Viaarxiv icon

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Add code
Jul 19, 2024
Figure 1 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Figure 2 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Figure 3 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Figure 4 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Viaarxiv icon

Towards Automated Circuit Discovery for Mechanistic Interpretability

Add code
Apr 28, 2023
Figure 1 for Towards Automated Circuit Discovery for Mechanistic Interpretability
Figure 2 for Towards Automated Circuit Discovery for Mechanistic Interpretability
Figure 3 for Towards Automated Circuit Discovery for Mechanistic Interpretability
Figure 4 for Towards Automated Circuit Discovery for Mechanistic Interpretability
Viaarxiv icon

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Add code
Jun 10, 2022
Viaarxiv icon

Data augmentation in Bayesian neural networks and the cold posterior effect

Add code
Jun 10, 2021
Figure 1 for Data augmentation in Bayesian neural networks and the cold posterior effect
Figure 2 for Data augmentation in Bayesian neural networks and the cold posterior effect
Figure 3 for Data augmentation in Bayesian neural networks and the cold posterior effect
Viaarxiv icon

BNNpriors: A library for Bayesian neural network inference with different prior distributions

Add code
May 14, 2021
Figure 1 for BNNpriors: A library for Bayesian neural network inference with different prior distributions
Figure 2 for BNNpriors: A library for Bayesian neural network inference with different prior distributions
Figure 3 for BNNpriors: A library for Bayesian neural network inference with different prior distributions
Figure 4 for BNNpriors: A library for Bayesian neural network inference with different prior distributions
Viaarxiv icon