Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Susav Shrestha

Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity

May 20, 2025

Susav Shrestha, Brad Settlemyer, Nikoli Dryden, Narasimha Reddy

Figure 1 for Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity

Figure 2 for Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity

Figure 3 for Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity

Figure 4 for Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity

Abstract:Accelerating large language model (LLM) inference is critical for real-world deployments requiring high throughput and low latency. Contextual sparsity, where each token dynamically activates only a small subset of the model parameters, shows promise but does not scale to large batch sizes due to union of active neurons quickly approaching dense computation. We introduce Polar Sparsity, highlighting a key shift in sparsity importance from MLP to Attention layers as we scale batch size and sequence length. While MLP layers become more compute-efficient under batching, their sparsity vanishes. In contrast, attention becomes increasingly more expensive at scale, while their head sparsity remains stable and batch-invariant. We develop hardware-efficient, sparsity-aware GPU kernels for selective MLP and Attention computations, delivering up to \(2.2\times\) end-to-end speedups for models like OPT, LLaMA-2 \& 3, across various batch sizes and sequence lengths without compromising accuracy. To our knowledge, this is the first work to demonstrate that contextual sparsity can scale effectively to large batch sizes, delivering substantial inference acceleration with minimal changes, making Polar Sparsity practical for large-scale, high-throughput LLM deployment systems. Our code is available at: https://github.com/susavlsh10/Polar-Sparsity.

Via

Access Paper or Ask Questions

ESPN: Memory-Efficient Multi-Vector Information Retrieval

Dec 09, 2023

Susav Shrestha, Narasimha Reddy, Zongwang Li

Figure 1 for ESPN: Memory-Efficient Multi-Vector Information Retrieval

Figure 2 for ESPN: Memory-Efficient Multi-Vector Information Retrieval

Figure 3 for ESPN: Memory-Efficient Multi-Vector Information Retrieval

Figure 4 for ESPN: Memory-Efficient Multi-Vector Information Retrieval

Abstract:Recent advances in large language models have demonstrated remarkable effectiveness in information retrieval (IR) tasks. While many neural IR systems encode queries and documents into single-vector representations, multi-vector models elevate the retrieval quality by producing multi-vector representations and facilitating similarity searches at the granularity of individual tokens. However, these models significantly amplify memory and storage requirements for retrieval indices by an order of magnitude. This escalation in index size renders the scalability of multi-vector IR models progressively challenging due to their substantial memory demands. We introduce Embedding from Storage Pipelined Network (ESPN) where we offload the entire re-ranking embedding tables to SSDs and reduce the memory requirements by 5-16x. We design a software prefetcher with hit rates exceeding 90%, improving SSD based retrieval up to 6.4x, and demonstrate that we can maintain near memory levels of query latency even for large query batch sizes.

* 10 pages, 10 figures

Via

Access Paper or Ask Questions