Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

Sep 04, 2025

Krishna Teja Chitty-Venkata, Jie Ye, Xian-He Sun, Anthony Kougkas, Murali Emani, Venkatram Vishwanath, Bogdan Nicolae

Figure 1 for PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

Figure 2 for PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

Figure 3 for PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

Figure 4 for PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

Share this with someone who'll enjoy it:

Abstract:KV caching significantly improves the efficiency of Large Language Model (LLM) inference by storing attention states from previously processed tokens, enabling faster generation of subsequent tokens. However, as sequence length increases, the KV cache quickly becomes a major memory bottleneck. To address this, we propose PagedEviction, a novel fine-grained, structured KV cache pruning strategy that enhances the memory efficiency of vLLM's PagedAttention. Unlike existing approaches that rely on attention-based token importance or evict tokens across different vLLM pages, PagedEviction introduces an efficient block-wise eviction algorithm tailored for paged memory layouts. Our method integrates seamlessly with PagedAttention without requiring any modifications to its CUDA attention kernels. We evaluate PagedEviction across Llama-3.1-8B-Instruct, Llama-3.2-1B-Instruct, and Llama-3.2-3B-Instruct models on the LongBench benchmark suite, demonstrating improved memory usage with better accuracy than baselines on long context tasks.

* Preprint

View paper on

Share this with someone who'll enjoy it:

Title:PagedEviction: Structured Block-wise KV Cache Pruning for Efficient Large Language Model Inference

Paper and Code