Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Paul C. Bogdan

Slot Machines: How LLMs Keep Track of Multiple Entities

Apr 22, 2026

Paul C. Bogdan, Jack Lindsey

Abstract:Language models must bind entities to the attributes they possess and maintain several such binding relationships within a context. We study how multiple entities are represented across token positions and whether single tokens can carry bindings for more than one entity. We introduce a multi-slot probing approach that disentangles a single token's residual stream activation to recover information about both the currently described entity and the immediately preceding one. These two kinds of information are encoded in separate and largely orthogonal "current-entity" and "prior-entity" slots. We analyze the functional roles of these slots and find that they serve different purposes. In tandem with the current-entity slot, the prior-entity slot supports relational inferences, such as entity-level induction ("who came after Alice in the story?") and conflict detection between adjacent entities. However, only the current-entity slot is used for explicit factual retrieval questions ("Is anyone in the story tall?" "What is the tall entity's name?") despite these answers being linearly decodable from the prior-entity slot too. Consistent with this limitation, open-weight models perform near chance accuracy at processing syntax that forces two subject-verb-object bindings on a single token (e.g., "Alice prepares and Bob consumes food.") Interestingly, recent frontier models can parse this properly, suggesting they may have developed more sophisticated binding strategies. Overall, our results expose a gap between information that is available in activations and information the model actually uses, and suggest that the current/prior-entity slot structure is a natural substrate for behaviors that require holding two perspectives at once, such as sycophancy and deception.

Via

Access Paper or Ask Questions

Thought Branches: Interpreting LLM Reasoning Requires Resampling

Oct 31, 2025

Uzay Macar, Paul C. Bogdan, Senthooran Rajamanoharan, Neel Nanda

Abstract:Most work interpreting reasoning models studies only a single chain-of-thought (CoT), yet these models define distributions over many possible CoTs. We argue that studying a single sample is inadequate for understanding causal influence and the underlying computation. Though fully specifying this distribution is intractable, it can be understood by sampling. We present case studies using resampling to investigate model decisions. First, when a model states a reason for its action, does that reason actually cause the action? In "agentic misalignment" scenarios, we resample specific sentences to measure their downstream effects. Self-preservation sentences have small causal impact, suggesting they do not meaningfully drive blackmail. Second, are artificial edits to CoT sufficient for steering reasoning? These are common in literature, yet take the model off-policy. Resampling and selecting a completion with the desired property is a principled on-policy alternative. We find off-policy interventions yield small and unstable effects compared to resampling in decision-making tasks. Third, how do we understand the effect of removing a reasoning step when the model may repeat it post-edit? We introduce a resilience metric that repeatedly resamples to prevent similar content from reappearing downstream. Critical planning statements resist removal but have large effects when eliminated. Fourth, since CoT is sometimes "unfaithful", can our methods teach us anything in these settings? Adapting causal mediation analysis, we find that hints that have a causal effect on the output without being explicitly mentioned exert a subtle and cumulative influence on the CoT that persists even if the hint is removed. Overall, studying distributions via resampling enables reliable causal analysis, clearer narratives of model reasoning, and principled CoT interventions.

* Uzay Macar and Paul C. Bogdan contributed equally to this work, and their listed order was determined by coinflip

Via

Access Paper or Ask Questions

Thought Anchors: Which LLM Reasoning Steps Matter?

Jun 23, 2025

Paul C. Bogdan, Uzay Macar, Neel Nanda, Arthur Conmy

Abstract:Reasoning large language models have recently achieved state-of-the-art performance in many fields. However, their long-form chain-of-thought reasoning creates interpretability challenges as each generated token depends on all previous ones, making the computation harder to decompose. We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We present three complementary attribution methods: (1) a black-box method measuring each sentence's counterfactual importance by comparing final answers across 100 rollouts conditioned on the model generating that sentence or one with a different meaning; (2) a white-box method of aggregating attention patterns between pairs of sentences, which identified ``broadcasting'' sentences that receive disproportionate attention from all future sentences via ``receiver'' attention heads; (3) a causal attribution method measuring logical connections between sentences by suppressing attention toward one sentence and measuring the effect on each future sentence's tokens. Each method provides evidence for the existence of thought anchors, reasoning steps that have outsized importance and that disproportionately influence the subsequent reasoning process. These thought anchors are typically planning or backtracking sentences. We provide an open-source tool (www.thought-anchors.com) for visualizing the outputs of our methods, and present a case study showing converging patterns across methods that map how a model performs multi-step reasoning. The consistency across methods demonstrates the potential of sentence-level analysis for a deeper understanding of reasoning models.

* Paul C. Bogdan and Uzay Macar contributed equally to this work, and their listed order was determined by coinflip. Neel Nanda and Arthur Conmy contributed equally to this work as senior authors, and their listed order was determined by coinflip

Via

Access Paper or Ask Questions

Emergent effects of scaling on the functional hierarchies within large language models

Jan 13, 2025

Paul C. Bogdan

Abstract:Large language model (LLM) architectures are often described as functionally hierarchical: Early layers process syntax, middle layers begin to parse semantics, and late layers integrate information. The present work revisits these ideas. This research submits simple texts to an LLM (e.g., "A church and organ") and extracts the resulting activations. Then, for each layer, support vector machines and ridge regressions are fit to predict a text's label and thus examine whether a given layer encodes some information. Analyses using a small model (Llama-3.2-3b; 28 layers) partly bolster the common hierarchical perspective: Item-level semantics are most strongly represented early (layers 2-7), then two-item relations (layers 8-12), and then four-item analogies (layers 10-15). Afterward, the representation of items and simple relations gradually decreases in deeper layers that focus on more global information. However, several findings run counter to a steady hierarchy view: First, although deep layers can represent document-wide abstractions, deep layers also compress information from early portions of the context window without meaningful abstraction. Second, when examining a larger model (Llama-3.3-70b-Instruct), stark fluctuations in abstraction level appear: As depth increases, two-item relations and four-item analogies initially increase in their representation, then markedly decrease, and afterward increase again momentarily. This peculiar pattern consistently emerges across several experiments. Third, another emergent effect of scaling is coordination between the attention mechanisms of adjacent layers. Across multiple experiments using the larger model, adjacent layers fluctuate between what information they each specialize in representing. In sum, an abstraction hierarchy often manifests across layers, but large models also deviate from this structure in curious ways.

Via

Access Paper or Ask Questions