Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniel Balcells

Linear Personality Probing and Steering in LLMs: A Big Five Study

Dec 19, 2025

Michel Frising, Daniel Balcells

Figure 1 for Linear Personality Probing and Steering in LLMs: A Big Five Study

Figure 2 for Linear Personality Probing and Steering in LLMs: A Big Five Study

Figure 3 for Linear Personality Probing and Steering in LLMs: A Big Five Study

Figure 4 for Linear Personality Probing and Steering in LLMs: A Big Five Study

Abstract:Large language models (LLMs) exhibit distinct and consistent personalities that greatly impact trust and engagement. While this means that personality frameworks would be highly valuable tools to characterize and control LLMs' behavior, current approaches remain either costly (post-training) or brittle (prompt engineering). Probing and steering via linear directions has recently emerged as a cheap and efficient alternative. In this paper, we investigate whether linear directions aligned with the Big Five personality traits can be used for probing and steering model behavior. Using Llama 3.3 70B, we generate descriptions of 406 fictional characters and their Big Five trait scores. We then prompt the model with these descriptions and questions from the Alpaca questionnaire, allowing us to sample hidden activations that vary along personality traits in known, quantifiable ways. Using linear regression, we learn a set of per-layer directions in activation space, and test their effectiveness for probing and steering model behavior. Our results suggest that linear directions aligned with trait-scores are effective probes for personality detection, while their steering capabilities strongly depend on context, producing reliable effects in forced-choice tasks but limited influence in open-ended generation or when additional context is present in the prompt.

* 29 pages, 6 figures

Via

Access Paper or Ask Questions

Evolution of SAE Features Across Layers in LLMs

Oct 11, 2024

Daniel Balcells, Benjamin Lerner, Michael Oesterle, Ediz Ucar, Stefan Heimersheim

Figure 1 for Evolution of SAE Features Across Layers in LLMs

Figure 2 for Evolution of SAE Features Across Layers in LLMs

Figure 3 for Evolution of SAE Features Across Layers in LLMs

Figure 4 for Evolution of SAE Features Across Layers in LLMs

Abstract:Sparse Autoencoders for transformer-based language models are typically defined independently per layer. In this work we analyze statistical relationships between features in adjacent layers to understand how features evolve through a forward pass. We provide a graph visualization interface for features and their most similar next-layer neighbors, and build communities of related features across layers. We find that a considerable amount of features are passed through from a previous layer, some features can be expressed as quasi-boolean combinations of previous features, and some features become more specialized in later layers.

Via

Access Paper or Ask Questions