Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ion Stoica

BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Nov 25, 2024

Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica

Figure 1 for BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Figure 2 for BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Figure 3 for BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Figure 4 for BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Abstract:Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and modality makes requests more diverse in compute and memory demands, creating unique opportunities for throughput improvement by resource overlapping. However, a request schedule that maximizes resource overlapping can conflict with the schedule that maximizes prefix sharing, a widely-used performance optimization, causing sub-optimal inference throughput. We present BlendServe, a system that maximizes resource utilization of offline batch inference by combining the benefits of resource overlapping and prefix sharing using a resource-aware prefix tree. BlendServe exploits the relaxed latency requirements in offline batch inference to reorder and overlap requests with varied resource demands while ensuring high prefix sharing. We evaluate BlendServe on a variety of synthetic multi-modal workloads and show that it provides up to $1.44\times$ throughput boost compared to widely-used industry standards, vLLM and SGLang.

Via

Access Paper or Ask Questions

MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

Nov 18, 2024

Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, Ion Stoica

Figure 1 for MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

Figure 2 for MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

Figure 3 for MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

Figure 4 for MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

Abstract:Efficient deployment of large language models, particularly Mixture of Experts (MoE), on resource-constrained platforms presents significant challenges, especially in terms of computational efficiency and memory utilization. The MoE architecture, renowned for its ability to increase model capacity without a proportional increase in inference cost, greatly reduces the token generation latency compared with dense models. However, the large model size makes MoE models inaccessible to individuals without high-end GPUs. In this paper, we propose a high-throughput MoE batch inference system, that significantly outperforms past work. MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization, and a performance model, HRM, based on a Hierarchical Roofline Model we introduce to help find policies with higher throughput than existing systems. MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB). When the theoretical system throughput is bounded by the GPU memory, MoE-Lightning can reach the throughput upper bound with 2-3x less CPU memory, significantly increasing resource utilization. MoE-Lightning also supports efficient batch inference for much larger MoEs (e.g., Mixtral 8x22B and DBRX) on multiple low-cost GPUs (e.g., 2-4 T4).

Via

Access Paper or Ask Questions

Pie: Pooling CPU Memory for LLM Inference

Nov 14, 2024

Yi Xu, Ziming Mao, Xiangxi Mo, Shu Liu, Ion Stoica

Figure 1 for Pie: Pooling CPU Memory for LLM Inference

Figure 2 for Pie: Pooling CPU Memory for LLM Inference

Figure 3 for Pie: Pooling CPU Memory for LLM Inference

Figure 4 for Pie: Pooling CPU Memory for LLM Inference

Abstract:The rapid growth of LLMs has revolutionized natural language processing and AI analysis, but their increasing size and memory demands present significant challenges. A common solution is to spill over to CPU memory; however, traditional GPU-CPU memory swapping often results in higher latency and lower throughput. This paper introduces Pie, an LLM inference framework that addresses these challenges with performance-transparent swapping and adaptive expansion. By leveraging predictable memory access patterns and the high bandwidth of modern hardware like the NVIDIA GH200 Grace Hopper Superchip, Pie enables concurrent data swapping without affecting foreground computation, expanding effective memory without added latency. Adaptive expansion dynamically adjusts CPU memory allocation based on real-time information, optimizing memory usage and performance under varying conditions. Pie maintains low computation latency, high throughput, and high elasticity. Our experimental evaluation demonstrates that Pie achieves optimal swapping policy during cache warmup and effectively balances increased memory capacity with negligible impact on computation. With its extended capacity, Pie outperforms vLLM by up to 1.9X in throughput and 2X in latency. Additionally, Pie can reduce GPU memory usage by up to 1.67X while maintaining the same performance. Compared to FlexGen, an offline profiling-based swapping solution, Pie achieves magnitudes lower latency and 9.4X higher throughput.

Via

Access Paper or Ask Questions

SkyServe: Serving AI Models across Regions and Clouds with Spot Instances

Nov 03, 2024

Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, Ion Stoica

Figure 1 for SkyServe: Serving AI Models across Regions and Clouds with Spot Instances

Figure 2 for SkyServe: Serving AI Models across Regions and Clouds with Spot Instances

Figure 3 for SkyServe: Serving AI Models across Regions and Clouds with Spot Instances

Figure 4 for SkyServe: Serving AI Models across Regions and Clouds with Spot Instances

Abstract:Recent years have witnessed an explosive growth of AI models. The high cost of hosting AI services on GPUs and their demanding service requirements, make it timely and challenging to lower service costs and guarantee service quality. While spot instances have long been offered with a large discount, spot preemptions have discouraged users from using them to host model replicas when serving AI models. To address this, we introduce SkyServe, a system that efficiently serves AI models over a mixture of spot and on-demand replicas across regions and clouds. SkyServe intelligently spreads spot replicas across different failure domains (e.g., regions or clouds) to improve availability and reduce correlated preemptions, overprovisions cheap spot replicas than required as a safeguard against possible preemptions, and dynamically falls back to on-demand replicas when spot replicas become unavailable. We compare SkyServe with both research and production systems on real AI workloads: SkyServe reduces cost by up to 44% while achieving high resource availability compared to using on-demand replicas. Additionally, SkyServe improves P50, P90, and P99 latency by up to 2.6x, 3.1x, 2.7x compared to other research and production systems.

Via

Access Paper or Ask Questions

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Nov 02, 2024

Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu

Figure 1 for NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Figure 2 for NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Figure 3 for NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Figure 4 for NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Abstract:Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents. Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to make it cost-efficient when running on expensive GPU accelerators. However, the limited GPU memory has largely limited the batch size achieved in practice, leaving significant GPU compute resources wasted. We present NEO, an online LLM inference system that offloads part of attention compute and KV cache states from the GPU to the local host CPU, effectively increasing the GPU batch size and thus inference throughput. To this end, NEO proposes asymmetric GPU-CPU pipelining and load-aware scheduling to balance GPU and CPU loads and fully utilize their compute and memory resources. We evaluate NEO on a wide range of workloads (i.e., code generation, text summarization), GPUs (i.e., T4, A10G, H100), and LLM models (i.e., 7B, 8B, 70B). NEO achieves up to 7.5$\times$, 26%, and 14% higher throughput compared to GPU-only approach on T4, A10G, and H100 GPUs, respectively, while maintaining the same latency; with more powerful CPUs, NEO achieves up to 79.3% throughput gain on A10G GPU.

Via

Access Paper or Ask Questions

Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving

Oct 21, 2024

Alexander Krentsel, Peter Schafhalter, Joseph E. Gonzalez, Sylvia Ratnasamy, Scott Shenker, Ion Stoica

Figure 1 for Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving

Figure 2 for Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving

Figure 3 for Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving

Figure 4 for Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving

Abstract:Prevailing wisdom asserts that one cannot rely on the cloud for critical real-time control systems like self-driving cars. We argue that we can, and must. Following the trends of increasing model sizes, improvements in hardware, and evolving mobile networks, we identify an opportunity to offload parts of time-sensitive and latency-critical compute to the cloud. Doing so requires carefully allocating bandwidth to meet strict latency SLOs, while maximizing benefit to the car.

* 6 pages

Via

Access Paper or Ask Questions

How to Evaluate Reward Models for RLHF

Oct 18, 2024

Evan Frick, Tianle Li, Connor Chen, Wei-Lin Chiang, Anastasios N. Angelopoulos, Jiantao Jiao, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica

Figure 1 for How to Evaluate Reward Models for RLHF

Figure 2 for How to Evaluate Reward Models for RLHF

Figure 3 for How to Evaluate Reward Models for RLHF

Figure 4 for How to Evaluate Reward Models for RLHF

Abstract:We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback). The gold-standard approach is to run a full RLHF training pipeline and directly probe downstream LLM performance. However, this process is prohibitively expensive. To address this, we build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks. These proxy tasks consist of a large-scale human preference and a verifiable correctness preference dataset, in which we measure 12 metrics across 12 domains. To investigate which reward model metrics are most correlated to gold-standard RLHF outcomes, we launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth. Ultimately, we compile our data and findings into Preference Proxy Evaluations (PPE), the first reward model benchmark explicitly linked to post-RLHF real-world human preference performance, which we open-source for public use and further development. Our code and evaluations can be found at https://github.com/lmarena/PPE .

Via

Access Paper or Ask Questions

JudgeBench: A Benchmark for Evaluating LLM-based Judges

Oct 16, 2024

Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y. Tang, Alejandro Cuadron, Chenguang Wang, Raluca Ada Popa, Ion Stoica

Figure 1 for JudgeBench: A Benchmark for Evaluating LLM-based Judges

Figure 2 for JudgeBench: A Benchmark for Evaluating LLM-based Judges

Figure 3 for JudgeBench: A Benchmark for Evaluating LLM-based Judges

Figure 4 for JudgeBench: A Benchmark for Evaluating LLM-based Judges

Abstract:LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more advanced, their responses grow more sophisticated, requiring stronger judges to evaluate them. Existing benchmarks primarily focus on a judge's alignment with human preferences, but often fail to account for more challenging tasks where crowdsourced human preference is a poor indicator of factual and logical correctness. To address this, we propose a novel evaluation framework to objectively evaluate LLM-based judges. Based on this framework, we propose JudgeBench, a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding. JudgeBench leverages a novel pipeline for converting existing difficult datasets into challenging response pairs with preference labels reflecting objective correctness. Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks, with many strong models (e.g., GPT-4o) performing just slightly better than random guessing. Overall, JudgeBench offers a reliable platform for assessing increasingly advanced LLM-based judges. Data and code are available at https://github.com/ScalerLab/JudgeBench .

* preprint

Via

Access Paper or Ask Questions

Efficient LLM Scheduling by Learning to Rank

Aug 28, 2024

Yichao Fu, Siqi Zhu, Runlong Su, Aurick Qiao, Ion Stoica, Hao Zhang

Figure 1 for Efficient LLM Scheduling by Learning to Rank

Figure 2 for Efficient LLM Scheduling by Learning to Rank

Figure 3 for Efficient LLM Scheduling by Learning to Rank

Figure 4 for Efficient LLM Scheduling by Learning to Rank

Abstract:In Large Language Model (LLM) inference, the output length of an LLM request is typically regarded as not known a priori. Consequently, most LLM serving systems employ a simple First-come-first-serve (FCFS) scheduling strategy, leading to Head-Of-Line (HOL) blocking and reduced throughput and service quality. In this paper, we reexamine this assumption -- we show that, although predicting the exact generation length of each request is infeasible, it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. The ranking information offers valuable guidance for scheduling requests. Building on this insight, we develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches. We integrate this scheduler with the state-of-the-art LLM serving system and show significant performance improvement in several important applications: 2.8x lower latency in chatbot serving and 6.5x higher throughput in synthetic data generation. Our code is available at https://github.com/hao-ai-lab/vllm-ltr.git

Via

Access Paper or Ask Questions

Post-Training Sparse Attention with Double Sparsity

Aug 11, 2024

Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng

Figure 1 for Post-Training Sparse Attention with Double Sparsity

Figure 2 for Post-Training Sparse Attention with Double Sparsity

Figure 3 for Post-Training Sparse Attention with Double Sparsity

Figure 4 for Post-Training Sparse Attention with Double Sparsity

Abstract:The inference process for large language models is slow and memory-intensive, with one of the most critical bottlenecks being excessive Key-Value (KV) cache accesses. This paper introduces "Double Sparsity," a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access. Double Sparsity combines token sparsity, which focuses on utilizing only the important tokens for computing self-attention, with channel sparsity, an approach that uses important feature channels for identifying important tokens. Our key insight is that the pattern of channel sparsity is relatively static, allowing us to use offline calibration to make it efficient at runtime, thereby enabling accurate and efficient identification of important tokens. Moreover, this method can be combined with offloading to achieve significant memory usage reduction. Experimental results demonstrate that Double Sparsity can achieve $\frac{1}{16}$ token and channel sparsity with minimal impact on accuracy across various tasks, including wiki-2 perplexity, key-value retrieval, and long context benchmarks with models including Llama-2-7B, Llama-2-70B, and Mixtral-8x7B. It brings up to a 14.1$\times$ acceleration in attention operations and a 1.9$\times$ improvement in end-to-end inference on GPUs. With offloading, it achieves a decoding speed acceleration of 16.3$\times$ compared to state-of-the-art solutions at a sequence length of 256K. Our code is publicly available at \url{https://github.com/andy-yang-1/DoubleSparse}.

Via

Access Paper or Ask Questions