Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gabriel Mongaras

2Mamba2Furious: Linear in Complexity, Competitive in Accuracy

Feb 19, 2026

Gabriel Mongaras, Eric C. Larson

Abstract:Linear attention transformers have become a strong alternative to softmax attention due to their efficiency. However, linear attention tends to be less expressive and results in reduced accuracy compared to softmax attention. To bridge the accuracy gap between softmax attention and linear attention, we manipulate Mamba-2, a very strong linear attention variant. We first simplify Mamba-2 down to its most fundamental and important components, evaluating which specific choices make it most accurate. From this simplified Mamba variant (Mamba-2S), we improve the A-mask and increase the order of the hidden state, resulting in a method, which we call 2Mamba, that is nearly as accurate as softmax attention, yet much more memory efficient for long context lengths. We also investigate elements to Mamba-2 that help surpass softmax attention accuracy. Code is provided for all our experiments

Via

Access Paper or Ask Questions

On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective

Jul 31, 2025

Gabriel Mongaras, Eric C. Larson

Figure 1 for On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective

Figure 2 for On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective

Figure 3 for On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective

Figure 4 for On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective

Abstract:Since its introduction, softmax attention has become the backbone of modern transformer architectures due to its expressiveness and scalability across a wide range of tasks. However, the main drawback of softmax attention is the quadratic memory requirement and computational complexity with respect to the sequence length. By replacing the softmax nonlinearity, linear attention and similar methods have been introduced to avoid the quadratic bottleneck of softmax attention. Despite these linear forms of attention being derived from the original softmax formulation, they typically lag in terms of downstream accuracy. While strong intuition of the softmax nonlinearity on the query and key inner product suggests that it has desirable properties compared to other nonlinearities, the question of why this discrepancy exists still remains unanswered. This work demonstrates that linear attention is an approximation of softmax attention by deriving the recurrent form of softmax attention. Using this form, each part of softmax attention can be described in the language of recurrent neural networks (RNNs). Describing softmax attention as an RNN allows for the ablation of the components of softmax attention to understand the importance of each part and how they interact. In this way, our work helps explain why softmax attention is more expressive than its counterparts.

Via

Access Paper or Ask Questions

Cottention: Linear Transformers With Cosine Attention

Sep 27, 2024

Gabriel Mongaras, Trevor Dohm, Eric C. Larson

Figure 1 for Cottention: Linear Transformers With Cosine Attention

Figure 2 for Cottention: Linear Transformers With Cosine Attention

Figure 3 for Cottention: Linear Transformers With Cosine Attention

Figure 4 for Cottention: Linear Transformers With Cosine Attention

Abstract:Attention mechanisms, particularly softmax attention, have been instrumental in the success of transformer-based models such as GPT. However, the quadratic memory complexity of softmax attention with respect to sequence length poses significant challenges for processing longer sequences. We introduce Cottention, a novel attention mechanism that replaces the softmax operation with cosine similarity. By leveraging the properties of cosine similarity and rearranging the attention equation, Cottention achieves native linear memory complexity with respect to sequence length, making it inherently more memory-efficient than softmax attention. We demonstrate that Cottention can be reformulated as a recurrent neural network (RNN) with a finite hidden state, allowing for constant memory usage during inference. We evaluate Cottention on both the bidirectional BERT and causal GPT tasks, demonstrating comparable performance to softmax attention while significantly reducing memory requirements. To ensure efficient computation, we develop a custom CUDA kernel for Cottention. Our results show that Cottention is a promising alternative to softmax attention, enabling the processing of longer sequences without sacrificing performance, due to its native linear memory complexity and ability to maintain a constant memory footprint during inference.

* 12 pages, 5 figures

Via

Access Paper or Ask Questions