Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jeremias Ferrao

The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

Sep 16, 2025

Jeremias Ferrao, Matthijs van der Lende, Ilija Lichkovski, Clement Neo

Figure 1 for The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

Figure 2 for The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

Figure 3 for The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

Figure 4 for The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

Abstract:Aligning large language models is critical for their usability and safety. However, the prevailing approach of Reinforcement Learning from Human Feedback (RLHF) induces diffuse, opaque parameter changes, making it difficult to discern what the model has internalized. Hence, we introduce Feature Steering with Reinforcement Learning (FSRL), a transparent alignment framework that trains a lightweight adapter to steer behavior by modulating interpretable features from a Sparse Autoencoder (SAE). First, we demonstrate that FSRL is an effective method for preference optimization and is comparable with current RLHF methods. We then perform mechanistic analysis on the trained adapter, and find that its policy systematically promotes style features over explicit alignment concepts, suggesting that the preference optimization process rewards stylistic presentation as a proxy for quality. Ultimately, we hope that FSRL provides a tool for both interpretable model control and diagnosing the internal mechanisms of alignment.

* Work in Progress

Via

Access Paper or Ask Questions

Self-Ablating Transformers: More Interpretability, Less Sparsity

May 01, 2025

Jeremias Ferrao, Luhan Mikaelson, Keenan Pepper, Natalia Perez-Campanero Antolin

Figure 1 for Self-Ablating Transformers: More Interpretability, Less Sparsity

Figure 2 for Self-Ablating Transformers: More Interpretability, Less Sparsity

Figure 3 for Self-Ablating Transformers: More Interpretability, Less Sparsity

Figure 4 for Self-Ablating Transformers: More Interpretability, Less Sparsity

Abstract:A growing intuition in machine learning suggests a link between sparsity and interpretability. We introduce a novel self-ablation mechanism to investigate this connection ante-hoc in the context of language transformers. Our approach dynamically enforces a k-winner-takes-all constraint, forcing the model to demonstrate selective activation across neuron and attention units. Unlike post-hoc methods that analyze already-trained models, our approach integrates interpretability directly into model training, promoting feature localization from inception. Training small models on the TinyStories dataset and employing interpretability tests, we find that self-ablation leads to more localized circuits, concentrated feature representations, and increased neuron specialization without compromising language modelling performance. Surprisingly, our method also decreased overall sparsity, indicating that self-ablation promotes specialization rather than widespread inactivity. This reveals a complex interplay between sparsity and interpretability, where decreased global sparsity can coexist with increased local specialization, leading to enhanced interpretability. To facilitate reproducibility, we make our code available at https://github.com/keenanpepper/self-ablating-transformers.

* Poster Presentation at Building Trust Workshop at ICLR 2025

Via

Access Paper or Ask Questions

World Model Agents with Change-Based Intrinsic Motivation

Mar 26, 2025

Jeremias Ferrao, Rafael Cunha

Figure 1 for World Model Agents with Change-Based Intrinsic Motivation

Figure 2 for World Model Agents with Change-Based Intrinsic Motivation

Figure 3 for World Model Agents with Change-Based Intrinsic Motivation

Figure 4 for World Model Agents with Change-Based Intrinsic Motivation

Abstract:Sparse reward environments pose a significant challenge for reinforcement learning due to the scarcity of feedback. Intrinsic motivation and transfer learning have emerged as promising strategies to address this issue. Change Based Exploration Transfer (CBET), a technique that combines these two approaches for model-free algorithms, has shown potential in addressing sparse feedback but its effectiveness with modern algorithms remains understudied. This paper provides an adaptation of CBET for world model algorithms like DreamerV3 and compares the performance of DreamerV3 and IMPALA agents, both with and without CBET, in the sparse reward environments of Crafter and Minigrid. Our tabula rasa results highlight the possibility of CBET improving DreamerV3's returns in Crafter but the algorithm attains a suboptimal policy in Minigrid with CBET further reducing returns. In the same vein, our transfer learning experiments show that pre-training DreamerV3 with intrinsic rewards does not immediately lead to a policy that maximizes extrinsic rewards in Minigrid. Overall, our results suggest that CBET provides a positive impact on DreamerV3 in more complex environments like Crafter but may be detrimental in environments like Minigrid. In the latter case, the behaviours promoted by CBET in DreamerV3 may not align with the task objectives of the environment, leading to reduced returns and suboptimal policies.

* Submitted to Northern Lights Deep Learning Conference 2025

Via

Access Paper or Ask Questions