Picture for Jeremias Ferrao

Jeremias Ferrao

The Anatomy of Alignment: Decomposing Preference Optimization by Steering Sparse Features

Add code
Sep 16, 2025
Viaarxiv icon

Self-Ablating Transformers: More Interpretability, Less Sparsity

Add code
May 01, 2025
Figure 1 for Self-Ablating Transformers: More Interpretability, Less Sparsity
Figure 2 for Self-Ablating Transformers: More Interpretability, Less Sparsity
Figure 3 for Self-Ablating Transformers: More Interpretability, Less Sparsity
Figure 4 for Self-Ablating Transformers: More Interpretability, Less Sparsity
Viaarxiv icon

World Model Agents with Change-Based Intrinsic Motivation

Add code
Mar 26, 2025
Figure 1 for World Model Agents with Change-Based Intrinsic Motivation
Figure 2 for World Model Agents with Change-Based Intrinsic Motivation
Figure 3 for World Model Agents with Change-Based Intrinsic Motivation
Figure 4 for World Model Agents with Change-Based Intrinsic Motivation
Viaarxiv icon