Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jie Ye

PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

Sep 04, 2025

Krishna Teja Chitty-Venkata, Jie Ye, Xian-He Sun, Anthony Kougkas, Murali Emani, Venkatram Vishwanath, Bogdan Nicolae

Figure 1 for PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

Figure 2 for PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

Figure 3 for PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

Figure 4 for PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

Abstract:KV caching significantly improves the efficiency of Large Language Model (LLM) inference by storing attention states from previously processed tokens, enabling faster generation of subsequent tokens. However, as sequence length increases, the KV cache quickly becomes a major memory bottleneck. To address this, we propose PagedEviction, a novel fine-grained, structured KV cache pruning strategy that enhances the memory efficiency of vLLM's PagedAttention. Unlike existing approaches that rely on attention-based token importance or evict tokens across different vLLM pages, PagedEviction introduces an efficient block-wise eviction algorithm tailored for paged memory layouts. Our method integrates seamlessly with PagedAttention without requiring any modifications to its CUDA attention kernels. We evaluate PagedEviction across Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct models on the LongBench benchmark suite, demonstrating improved memory usage with better accuracy than baselines on long context tasks.

* Preprint

Via

Access Paper or Ask Questions

Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

Oct 26, 2024

Avinash Maurya, Jie Ye, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae

Figure 1 for Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

Figure 2 for Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

Figure 3 for Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

Figure 4 for Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

Abstract:Transformers and large language models~(LLMs) have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is very expensive and often hits a ``memory wall'', i.e., even when using 3D parallelism (pipeline, tensor, data) and aggregating the memory of many GPUs, it is still not enough to hold the necessary data structures (model parameters, optimizer state, gradients, activations) in GPU memory. To compensate, state-of-the-art approaches offload the optimizer state, at least partially, to the host memory and perform hybrid CPU-GPU computations. However, the management of the combined host-GPU memory is often suboptimal and results in poor overlapping between data movements and computations. This leads to missed opportunities to simultaneously leverage the interconnect bandwidth and computational capabilities of CPUs and GPUs. In this paper, we leverage a key observation that the interleaving of the forward, backward and update phases generate fluctuations in the GPU memory utilization, which can be exploited to dynamically move a part of the optimizer state between the host and the GPU memory at each iteration. To this end, we design and implement \proj, a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU based on our proposed performance model that addresses the trade-off between data movement cost, acceleration on the GPUs vs the CPUs, and competition for shared resources. We integrate our approach with DeepSpeed and demonstrate 2.5$\times$ faster iterations over state-of-the-art approaches using extensive experiments.

Via

Access Paper or Ask Questions