Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dhruv Deshmukh

Department of CSE, IIT Bhilai, India

Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference

Dec 18, 2025

Dhruv Deshmukh, Saurabh Goyal, Nipun Kwatra, Ramachandran Ramjee

Figure 1 for Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference

Figure 2 for Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference

Figure 3 for Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference

Figure 4 for Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference

Abstract:Attention is the dominant source of latency during long-context LLM inference, an increasingly popular workload with reasoning models and RAG. We propose Kascade, a training-free sparse attention method that leverages known observations such as 1) post-softmax attention is intrinsically sparse, and 2) the identity of high-weight keys is stable across nearby layers. Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers. The anchor layers are selected algorithmically, via a dynamic-programming objective that maximizes cross-layer similarity over a development set, allowing easy deployment across models. The method incorporates efficient implementation constraints (e.g. tile-level operations), across both prefill and decode attention. The Top-k selection and reuse in Kascade is head-aware and we show in our experiments that this is critical for high accuracy. Kascade achieves up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention over FlashAttention-3 baseline on H100 GPUs while closely matching dense attention accuracy on long-context benchmarks such as LongBench and AIME-24.

* 11 pages, 8 figures, 3 tables and 1 algorithm

Via

Access Paper or Ask Questions

Entropy Aware Training for Fast and Accurate Distributed GNN

Nov 04, 2023

Dhruv Deshmukh, Gagan Raj Gupta, Manisha Chawla, Vishwesh Jatala, Anirban Haldar

Figure 1 for Entropy Aware Training for Fast and Accurate Distributed GNN

Figure 2 for Entropy Aware Training for Fast and Accurate Distributed GNN

Figure 3 for Entropy Aware Training for Fast and Accurate Distributed GNN

Figure 4 for Entropy Aware Training for Fast and Accurate Distributed GNN

Abstract:Several distributed frameworks have been developed to scale Graph Neural Networks (GNNs) on billion-size graphs. On several benchmarks, we observe that the graph partitions generated by these frameworks have heterogeneous data distributions and class imbalance, affecting convergence, and resulting in lower performance than centralized implementations. We holistically address these challenges and develop techniques that reduce training time and improve accuracy. We develop an Edge-Weighted partitioning technique to improve the micro average F1 score (accuracy) by minimizing the total entropy. Furthermore, we add an asynchronous personalization phase that adapts each compute-host's model to its local data distribution. We design a class-balanced sampler that considerably speeds up convergence. We implemented our algorithms on the DistDGL framework and observed that our training techniques scale much better than the existing training approach. We achieved a (2-3x) speedup in training time and 4\% improvement on average in micro-F1 scores on 5 large graph benchmarks compared to the standard baselines.

* 8 pages, 3 figures, 5 tables, accepted at ICDM'23

Via

Access Paper or Ask Questions