Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Manlai Liang

ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models

Aug 25, 2025

Manlai Liang, Mandi Liu, Jiangzhou Ji, Huaijun Li, Haobo Yang, Yaohan He, Jinlong Li

Figure 1 for ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models

Figure 2 for ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models

Figure 3 for ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models

Figure 4 for ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models

Abstract:Large Language Models (LLMs) have demonstrated success across many benchmarks. However, they still exhibit limitations in long-context scenarios, primarily due to their short effective context length, quadratic computational complexity, and high memory overhead when processing lengthy inputs. To mitigate these issues, we introduce a novel context compression pipeline, called Intermediate Layer Retrieval (ILRe), which determines one intermediate decoder layer offline, encodes context by streaming chunked prefill only up to that layer, and recalls tokens by the attention scores between the input query and full key cache in that specified layer. In particular, we propose a multi-pooling kernels allocating strategy in the token recalling process to maintain the completeness of semantics. Our approach not only reduces the prefilling complexity from $O(L^2)$ to $O(L)$, but also achieves performance comparable to or better than the full context in the long context scenarios. Without additional post training or operator development, ILRe can process a single $1M$ tokens request in less than half a minute (speedup $\approx 180\times$) and scores RULER-$1M$ benchmark of $\approx 79.8$ with model Llama-3.1-UltraLong-8B-1M-Instruct on a Huawei Ascend 910B NPU.

Via

Access Paper or Ask Questions

Lag-Relative Sparse Attention In Long Context Training

Jun 13, 2025

Manlai Liang, Wanyi Huang, Mandi Liu, Huaijun Li, Jinlong Li

Figure 1 for Lag-Relative Sparse Attention In Long Context Training

Figure 2 for Lag-Relative Sparse Attention In Long Context Training

Figure 3 for Lag-Relative Sparse Attention In Long Context Training

Figure 4 for Lag-Relative Sparse Attention In Long Context Training

Abstract:Large Language Models (LLMs) have made significant strides in natural language processing and generation, yet their ability to handle long-context input remains constrained by the quadratic complexity of attention computation and linear-increasing key-value memory footprint. To reduce computational costs and memory, key-value cache compression techniques are commonly applied at inference time, but this often leads to severe performance degradation, as models are not trained to handle compressed context. Although there are more sophisticated compression methods, they are typically unsuitable for post-training because of their incompatibility with gradient-based optimization or high computation overhead. To fill this gap with no additional parameter and little computation overhead, we propose Lag-Relative Sparse Attention(LRSA) anchored by the LagKV compression method for long context post-training. Our method performs chunk-by-chunk prefilling, which selects the top K most relevant key-value pairs in a fixed-size lagging window, allowing the model to focus on salient historical context while maintaining efficiency. Experimental results show that our approach significantly enhances the robustness of the LLM with key-value compression and achieves better fine-tuned results in the question-answer tuning task.

Via

Access Paper or Ask Questions

LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important

Apr 07, 2025

Manlai Liang, JiaMing Zhang, Xiong Li, Jinlong Li

Abstract:The increasing size of the Key-Value (KV) cache during the Large Language Models long-context inference is the main obstacle for its balance between the deployment cost and task accuracy. To reduce the KV cache size in such scenarios, most previous efforts leveraged on the attention weight to evict non-critical cache tokens. But there is a trade-off in those methods, they usually require major modifiation of the inference infrastructure and significant computation overhead. Base on the fact that the Large Lanuage models are autoregresssive models, we propose {\it LagKV}, a KV allocation strategy only relying on straight forward comparison among KV themself. It is a totally attention free method which offers easy integration to the main stream inference platform and comparable performance comparing to other complicated KV compression methods. Results on LongBench and PasskeyRetrieval show that, our approach achieves nearly zero loss when the ratio is $2\times$ and $\approx 90\%$ of the original model performance for $8\times$. Especially in the 64-digit passkey retrieval task, our mehod outperforms the attention weight based method $H_2O$ over $60\%$ with same compression ratios. Our code is available at \url{https://github.com/AI-Lab-China-Merchants-Bank/LagKV}.

Via

Access Paper or Ask Questions