Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mark Kurtz

An Interpretable Latency Model for Speculative Decoding in LLM Serving

May 14, 2026

Linghao Kong, Megan Flynn, Michael Peng, Nir Shavit, Mark Kurtz, Alexandre Marques

Abstract:Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller draft model to propose multiple tokens that are verified by a larger target model in parallel. While prior work demonstrates substantial speedups in isolated or fixed-batch settings, the behavior of SD in production serving systems remains poorly understood: request load varies over time, and effective batch size emerges from the serving system rather than being directly controlled or observed. In this work, we develop a simple and interpretable latency model for SD in LLM serving. We infer effective batch size from request rate using Little's Law and decompose per-request demand into load-independent and load-dependent components for prefill, drafting, and verification. We validate our model using extensive measurements from vLLM across verifier and drafter model sizes, prefill and decode lengths, request rates, draft lengths, and acceptance probabilities. The model accurately describes observed latency, explains why speedups often diminish as server load increases, and characterizes how draft length, acceptance rate, and verifier-drafter size shape latency across serving conditions, with implications for configuring SD in deployed systems. We further show how the framework extends to mixture of experts models, where sparse expert activation changes the effective service costs across load regimes. Together, our results provide a structured framework for understanding SD in real LLM serving systems.

* 10 pages, 8 figures

Via

Access Paper or Ask Questions

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Nov 04, 2024

Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh

Figure 1 for "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Figure 2 for "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Figure 3 for "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Figure 4 for "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Abstract:Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results. Our investigation, encompassing over 500,000 individual evaluations, yields several key findings: (1) FP8 weight and activation quantization (W8A8-FP) is lossless across all model scales, (2) INT8 weight and activation quantization (W8A8-INT), when properly tuned, incurs surprisingly low 1-3% accuracy degradation, and (3) INT4 weight-only quantization (W4A16-INT) is competitive with 8-bit integer weight and activation quantization. To address the question of the "best" format for a given deployment environment, we conduct inference performance analysis using the popular open-source vLLM framework on various GPU architectures. We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs. At the same time, W8A8 formats excel in asynchronous "continuous batching" deployment of mid- and large-size models on high-end GPUs. Our results provide a set of practical guidelines for deploying quantized LLMs across scales and performance requirements.

Via

Access Paper or Ask Questions

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

May 06, 2024

Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh(+2 more)

Abstract:Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning method and sparse pretraining of those models on a subset of the SlimPajama dataset mixed with a Python subset of The Stack dataset. We exhibit training acceleration due to sparsity on Cerebras CS-3 chips that closely matches theoretical scaling. In addition, we establish inference acceleration of up to 3x on CPUs by utilizing Neural Magic's DeepSparse engine and 1.7x on GPUs through Neural Magic's nm-vllm engine. The above gains are realized via sparsity alone, thus enabling further gains through additional use of quantization. Specifically, we show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x. We demonstrate these results across diverse, challenging tasks, including chat, instruction following, code generation, arithmetic reasoning, and summarization to prove their generality. This work paves the way for rapidly creating smaller and faster LLMs without sacrificing accuracy.

Via

Access Paper or Ask Questions

oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes

Apr 04, 2023

Daniel Campos, Alexandre Marques, Mark Kurtz, ChengXiang Zhai

Figure 1 for oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes

Figure 2 for oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes

Figure 3 for oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes

Figure 4 for oBERTa: Improving Sparse Transfer Learning via improved initialization, distillation, and pruning regimes

Abstract:In this paper, we introduce the range of oBERTa language models, an easy-to-use set of language models which allows Natural Language Processing (NLP) practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression. Specifically, oBERTa extends existing work on pruning, knowledge distillation, and quantization and leverages frozen embeddings improves distillation and model initialization to deliver higher accuracy on a broad range of transfer tasks. In generating oBERTa, we explore how the highly optimized RoBERTa differs from the BERT for pruning during pre-training and finetuning. We find it less amenable to compression during fine-tuning. We explore the use of oBERTa on seven representative NLP tasks and find that the improved compression techniques allow a pruned oBERTa model to match the performance of BERTbase and exceed the performance of Prune OFA Large on the SQUAD V1.1 Question Answering dataset, despite being 8x and 2x, respectively faster in inference. We release our code, training regimes, and associated model for broad usage to encourage usage and experimentation

Via

Access Paper or Ask Questions

**Sparse*BERT: Sparse Models are Robust**

May 25, 2022

Daniel Campos, Alexandre Marques, Tuan Nguyen, Mark Kurtz, ChengXiang Zhai

Figure 1 for Sparse*BERT: Sparse Models are Robust

Figure 2 for Sparse*BERT: Sparse Models are Robust

Figure 3 for Sparse*BERT: Sparse Models are Robust

Figure 4 for Sparse*BERT: Sparse Models are Robust

Abstract:Large Language Models have become the core architecture upon which most modern natural language processing (NLP) systems build. These models can consistently deliver impressive accuracy and robustness across tasks and domains, but their high computational overhead can make inference difficult and expensive. To make the usage of these models less costly recent work has explored leveraging structured and unstructured pruning, quantization, and distillation as ways to improve inference speed and decrease size. This paper studies how models pruned using Gradual Unstructured Magnitude Pruning can transfer between domains and tasks. Our experimentation shows that models that are pruned during pretraining using general domain masked language models can transfer to novel domains and tasks without extensive hyperparameter exploration or specialized approaches. We demonstrate that our general sparse model Sparse*BERT can become SparseBioBERT simply by pretraining the compressed architecture on unstructured biomedical text. Moreover, we show that SparseBioBERT can match the quality of BioBERT with only 10\% of the parameters.

Via

Access Paper or Ask Questions

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

Mar 14, 2022

Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, Dan Alistarh

Figure 1 for The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

Figure 2 for The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

Figure 3 for The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

Figure 4 for The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

Abstract:Pre-trained Transformer-based language models have become a key building block for natural language processing (NLP) tasks. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to be applicable to decrease model size and increase inference speed. In this context, this paper's contributions are two-fold. We begin with an in-depth study of the accuracy-compression trade-off for unstructured weight pruning in the context of BERT models, and introduce Optimal BERT Surgeon (O-BERT-S), an efficient and accurate weight pruning method based on approximate second-order information, which we show to yield state-of-the-art results in terms of the compression/accuracy trade-off. Specifically, Optimal BERT Surgeon extends existing work on second-order pruning by allowing for pruning blocks of weights, and by being applicable at BERT scale. Second, we investigate the impact of this pruning method when compounding compression approaches for Transformer-based models, which allows us to combine state-of-the-art structured and unstructured pruning together with quantization, in order to obtain highly compressed, but accurate models. The resulting compression framework is powerful, yet general and efficient: we apply it to both the fine-tuning and pre-training stages of language tasks, to obtain state-of-the-art results on the accuracy-compression trade-off with relatively simple compression recipes. For example, we obtain 10x model size compression with < 1% relative drop in accuracy to the dense BERT-base, 10x end-to-end CPU-inference speedup with < 2% relative drop in accuracy, and 29x inference speedups with < 7.5% relative accuracy drop.

Via

Access Paper or Ask Questions

How Well Do Sparse Imagenet Models Transfer?

Dec 23, 2021

Eugenia Iofinova, Alexandra Peste, Mark Kurtz, Dan Alistarh

Figure 1 for How Well Do Sparse Imagenet Models Transfer?

Figure 2 for How Well Do Sparse Imagenet Models Transfer?

Figure 3 for How Well Do Sparse Imagenet Models Transfer?

Figure 4 for How Well Do Sparse Imagenet Models Transfer?

Abstract:Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream," specialized datasets. Generally, it is understood that more accurate models on the "upstream" dataset will provide better transfer accuracy "downstream". In this work, we perform an in-depth investigation of this phenomenon in the context of convolutional neural networks (CNNs) trained on the ImageNet dataset, which have been pruned - that is, compressed by sparsifiying their connections. Specifically, we consider transfer using unstructured pruned models obtained by applying several state-of-the-art pruning methods, including magnitude-based, second-order, re-growth and regularization approaches, in the context of twelve standard transfer tasks. In a nutshell, our study shows that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities, and, while doing so, can lead to significant inference and even training speedups. At the same time, we observe and analyze significant differences in the behaviour of different pruning methods.

* 19 pages, 8 figures. Latest update is a correction to Fig. 2

Via

Access Paper or Ask Questions