Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mikhail Burtsev

GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

Mar 14, 2026

Yuri Kuratov, Matvey Kairov, Aydar Bulatov, Ivan Rodkin, Mikhail Burtsev

Abstract:Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is ompressive memory: read a context once, store it in a compact state, and answer many queries from that state. We study this in a context removal setting, where the model must generate an answer without access to the original context at inference time. We introduce GradMem, which writes context into memory via per-sample test-time optimization. Given a context, GradMem performs a few steps of gradient descent on a small set of prefix memory tokens while keeping model weights frozen. GradMem explicitly optimizes a model-level self-supervised context reconstruction loss, resulting in a loss-driven write operation with iterative error correction, unlike forward-only methods. On associative key--value retrieval, GradMem outperforms forward-only memory writers with the same memory size, and additional gradient steps scale capacity much more effectively than repeated forward writes. We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.

Via

Access Paper or Ask Questions

Q-RAG: Long Context Multi-step Retrieval via Value-based Embedder Training

Nov 10, 2025

Artyom Sorokin, Nazar Buzun, Alexander Anokhin, Oleg Inozemcev, Egor Vedernikov, Petr Anokhin, Mikhail Burtsev, Trushkov Alexey, Yin Wenshuai, Evgeny Burnaev

Abstract:Retrieval-Augmented Generation (RAG) methods enhance LLM performance by efficiently filtering relevant context for LLMs, reducing hallucinations and inference cost. However, most existing RAG methods focus on single-step retrieval, which is often insufficient for answering complex questions that require multi-step search. Recently, multi-step retrieval approaches have emerged, typically involving the fine-tuning of small LLMs to perform multi-step retrieval. This type of fine-tuning is highly resource-intensive and does not enable the use of larger LLMs. In this work, we propose Q-RAG, a novel approach that fine-tunes the Embedder model for multi-step retrieval using reinforcement learning (RL). Q-RAG offers a competitive, resource-efficient alternative to existing multi-step retrieval methods for open-domain question answering and achieves state-of-the-art results on the popular long-context benchmarks Babilong and RULER for contexts up to 10M tokens.

* 16 pages, 3 figures, 2 tables

Via

Access Paper or Ask Questions

Limitations of Normalization in Attention Mechanism

Aug 25, 2025

Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State

Figure 1 for Limitations of Normalization in Attention Mechanism

Figure 2 for Limitations of Normalization in Attention Mechanism

Figure 3 for Limitations of Normalization in Attention Mechanism

Figure 4 for Limitations of Normalization in Attention Mechanism

Abstract:This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.

* 10 pages, 4 figures

Via

Access Paper or Ask Questions

Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

Feb 18, 2025

Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev

Abstract:A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches allow to reduce the amount of compute in existing language models. Despite relying on powerful models as encoders, the maximum attainable lossless compression ratio is typically not higher than x10. This fact is highly intriguing because, in theory, the maximum information capacity of large real-valued vectors is far beyond the presented rates even for 16-bit precision and a modest vector size. In this work, we explore the limits of compression by replacing the encoder with a per-sample optimization procedure. We show that vectors with compression ratios up to x1500 exist, which highlights two orders of magnitude gap between existing and practically attainable solutions. Furthermore, we empirically show that the compression limits are determined not by the length of the input but by the amount of uncertainty to be reduced, namely, the cross-entropy loss on this sequence without any conditioning. The obtained limits highlight the substantial gap between the theoretical capacity of input embeddings and their practical utilization, suggesting significant room for optimization in model design.

Via

Access Paper or Ask Questions

Learning Elementary Cellular Automata with Transformers

Dec 02, 2024

Mikhail Burtsev

Figure 1 for Learning Elementary Cellular Automata with Transformers

Figure 2 for Learning Elementary Cellular Automata with Transformers

Figure 3 for Learning Elementary Cellular Automata with Transformers

Abstract:Large Language Models demonstrate remarkable mathematical capabilities but at the same time struggle with abstract reasoning and planning. In this study, we explore whether Transformers can learn to abstract and generalize the rules governing Elementary Cellular Automata. By training Transformers on state sequences generated with random initial conditions and local rules, we show that they can generalize across different Boolean functions of fixed arity, effectively abstracting the underlying rules. While the models achieve high accuracy in next-state prediction, their performance declines sharply in multi-step planning tasks without intermediate context. Our analysis reveals that including future states or rule prediction in the training loss enhances the models' ability to form internal representations of the rules, leading to improved performance in longer planning horizons and autoregressive generation. Furthermore, we confirm that increasing the model's depth plays a crucial role in extended sequential computations required for complex reasoning tasks. This highlights the potential to improve LLM with inclusion of longer horizons in loss function, as well as incorporating recurrence and adaptive computation time for dynamic control of model depth.

Via

Access Paper or Ask Questions

AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents

Jul 05, 2024

Petr Anokhin, Nikita Semenov, Artyom Sorokin, Dmitry Evseev, Mikhail Burtsev, Evgeny Burnaev

Figure 1 for AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents

Figure 2 for AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents

Figure 3 for AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents

Figure 4 for AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents

Abstract:Advancements in generative AI have broadened the potential applications of Large Language Models (LLMs) in the development of autonomous agents. Achieving true autonomy requires accumulating and updating knowledge gained from interactions with the environment and effectively utilizing it. Current LLM-based approaches leverage past experiences using a full history of observations, summarization or retrieval augmentation. However, these unstructured memory representations do not facilitate the reasoning and planning essential for complex decision-making. In our study, we introduce AriGraph, a novel method wherein the agent constructs a memory graph that integrates semantic and episodic memories while exploring the environment. This graph structure facilitates efficient associative retrieval of interconnected concepts, relevant to the agent's current state and goals, thus serving as an effective environmental model that enhances the agent's exploratory and planning capabilities. We demonstrate that our Ariadne LLM agent, equipped with this proposed memory architecture augmented with planning and decision-making, effectively handles complex tasks on a zero-shot basis in the TextWorld environment. Our approach markedly outperforms established methods such as full-history, summarization, and Retrieval-Augmented Generation in various tasks, including the cooking challenge from the First TextWorld Problems competition and novel tasks like house cleaning and puzzle Treasure Hunting.

* Code for this work is avaliable at https://github.com/AIRI-Institute/AriGraph

Via

Access Paper or Ask Questions

Associative Recurrent Memory Transformer

Jul 05, 2024

Ivan Rodkin, Yuri Kuratov, Aydar Bulatov, Mikhail Burtsev

Figure 1 for Associative Recurrent Memory Transformer

Figure 2 for Associative Recurrent Memory Transformer

Figure 3 for Associative Recurrent Memory Transformer

Figure 4 for Associative Recurrent Memory Transformer

Abstract:This paper addresses the challenge of creating a neural architecture for very long sequences that requires constant time for processing new information at each time step. Our approach, Associative Recurrent Memory Transformer (ARMT), is based on transformer self-attention for local context and segment-level recurrence for storage of task specific information distributed over a long context. We demonstrate that ARMT outperfors existing alternatives in associative retrieval tasks and sets a new performance record in the recent BABILong multi-task long-context benchmark by answering single-fact questions over 50 million tokens with an accuracy of 79.9%. The source code for training and evaluation is available on github.

* ICML 2024 Next Generation of Sequence Modeling Architectures Workshop

Via

Access Paper or Ask Questions

Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task

Jun 20, 2024

Alsu Sagirova, Mikhail Burtsev

Figure 1 for Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task

Figure 2 for Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task

Figure 3 for Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task

Figure 4 for Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task

Abstract:Even though Transformers are extensively used for Natural Language Processing tasks, especially for machine translation, they lack an explicit memory to store key concepts of processed texts. This paper explores the properties of the content of symbolic working memory added to the Transformer model decoder. Such working memory enhances the quality of model predictions in machine translation task and works as a neural-symbolic representation of information that is important for the model to make correct translations. The study of memory content revealed that translated text keywords are stored in the working memory, pointing to the relevance of memory content to the processed text. Also, the diversity of tokens and parts of speech stored in memory correlates with the complexity of the corpora for machine translation task.

* Cognitive Systems Research, Volume 75, 2022, Pages 16-24, ISSN 1389-0417
* 18 pages, 6 figures. Published in the journal Cognitive Systems Research 3 June 2022: https://www.sciencedirect.com/science/article/abs/pii/S1389041722000274

Via

Access Paper or Ask Questions

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Jun 14, 2024

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev

Figure 1 for BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Figure 2 for BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Figure 3 for BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Figure 4 for BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Abstract:In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60\% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers, enabling the processing of lengths up to 11 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 1 million token lengths.

Via

Access Paper or Ask Questions

In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Feb 21, 2024

Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Dmitry Sorokin, Artyom Sorokin, Mikhail Burtsev

Figure 1 for In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Figure 2 for In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Figure 3 for In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Figure 4 for In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Abstract:This paper addresses the challenge of processing long documents using generative transformer models. To evaluate different approaches, we introduce BABILong, a new benchmark designed to assess model capabilities in extracting and processing distributed facts within extensive texts. Our evaluation, which includes benchmarks for GPT-4 and RAG, reveals that common methods are effective only for sequences up to $10^4$ elements. In contrast, fine-tuning GPT-2 with recurrent memory augmentations enables it to handle tasks involving up to $11\times 10^6$ elements. This achievement marks a substantial leap, as it is by far the longest input processed by any neural network model to date, demonstrating a significant improvement in the processing capabilities for long sequences.

* 11M tokens, fix qa3 min facts per task in Table 1

Via

Access Paper or Ask Questions