Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andy Yang

Probability Distributions Computed by Hard-Attention Transformers

Oct 31, 2025

Andy Yang, Anej Svete, Jiaoda Li, Anthony Widjaja Lin, Jonathan Rawski, Ryan Cotterell, David Chiang

Figure 1 for Probability Distributions Computed by Hard-Attention Transformers

Figure 2 for Probability Distributions Computed by Hard-Attention Transformers

Abstract:Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). Here, we characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.

* 18 pages

Via

Access Paper or Ask Questions

Simulating Hard Attention Using Soft Attention

Dec 13, 2024

Andy Yang, Lena Strobl, David Chiang, Dana Angluin

Figure 1 for Simulating Hard Attention Using Soft Attention

Figure 2 for Simulating Hard Attention Using Soft Attention

Abstract:We study conditions under which transformers using soft attention can simulate hard attention, that is, effectively focus all attention on a subset of positions. First, we examine several variants of linear temporal logic, whose formulas have been previously been shown to be computable using hard attention transformers. We demonstrate how soft attention transformers can compute formulas of these logics using unbounded positional embeddings or temperature scaling. Second, we demonstrate how temperature scaling allows softmax transformers to simulate a large subclass of average-hard attention transformers, those that have what we call the uniform-tieless property.

Via

Access Paper or Ask Questions

A Formal Framework for Understanding Length Generalization in Transformers

Oct 03, 2024

Xinting Huang, Andy Yang, Satwik Bhattamishra, Yash Sarrof, Andreas Krebs, Hattie Zhou, Preetum Nakkiran, Michael Hahn

Figure 1 for A Formal Framework for Understanding Length Generalization in Transformers

Figure 2 for A Formal Framework for Understanding Length Generalization in Transformers

Figure 3 for A Formal Framework for Understanding Length Generalization in Transformers

Figure 4 for A Formal Framework for Understanding Length Generalization in Transformers

Abstract:A major challenge for transformers is generalizing to sequences longer than those observed during training. While previous works have empirically shown that transformers can either succeed or fail at length generalization depending on the task, theoretical understanding of this phenomenon remains limited. In this work, we introduce a rigorous theoretical framework to analyze length generalization in causal transformers with learnable absolute positional encodings. In particular, we characterize those functions that are identifiable in the limit from sufficiently long inputs with absolute positional encodings under an idealized inference scheme using a norm-based regularizer. This enables us to prove the possibility of length generalization for a rich family of problems. We experimentally validate the theory as a predictor of success and failure of length generalization across a range of algorithmic and formal language tasks. Our theory not only explains a broad set of empirical observations but also opens the way to provably predicting length generalization capabilities in transformers.

Via

Access Paper or Ask Questions

Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers

Apr 05, 2024

Andy Yang, David Chiang

Abstract:Deriving formal bounds on the expressivity of transformers, as well as studying transformers that are constructed to implement known algorithms, are both effective methods for better understanding the computational power of transformers. Towards both ends, we introduce the temporal counting logic $\textbf{K}_\text{t}$[#] alongside the RASP variant $\textbf{C-RASP}$. We show they are equivalent to each other, and that together they are the best-known lower bound on the formal expressivity of future-masked soft attention transformers with unbounded input size. We prove this by showing all $\textbf{K}_\text{t}$[#] formulas can be compiled into these transformers. As a case study, we demonstrate on paper how to use $\textbf{C-RASP}$ to construct simple transformer language models that, using greedy decoding, can only generate sentences that have given properties formally specified in $\textbf{K}_\text{t}$[#].

Via

Access Paper or Ask Questions

Masked Hard-Attention Transformers and Boolean RASP Recognize Exactly the Star-Free Languages

Oct 21, 2023

Dana Angluin, David Chiang, Andy Yang

Figure 1 for Masked Hard-Attention Transformers and Boolean RASP Recognize Exactly the Star-Free Languages

Figure 2 for Masked Hard-Attention Transformers and Boolean RASP Recognize Exactly the Star-Free Languages

Figure 3 for Masked Hard-Attention Transformers and Boolean RASP Recognize Exactly the Star-Free Languages

Abstract:We consider transformer encoders with hard attention (in which all attention is focused on exactly one position) and strict future masking (in which each position only attends to positions strictly to its left), and prove that the class of languages recognized by these networks is exactly the star-free languages. Adding position embeddings increases the class of recognized languages to other well-studied classes. A key technique in these proofs is Boolean RASP, a variant of RASP that is restricted to Boolean values. Via the star-free languages, we relate transformers to first-order logic, temporal logic, and algebraic automata theory.

Via

Access Paper or Ask Questions