Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xupeng Miao

AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding

Jan 21, 2025

Zikun Li, Zhuofu Chen, Remi Delacourt, Gabriele Oliaro, Zeyu Wang, Qinghan Chen, Shuhuai Lin, April Yang, Zhihao Zhang, Zhuoming Chen(+3 more)

Abstract:This paper introduces AdaServe, the first LLM serving system to support SLO customization through fine-grained speculative decoding. AdaServe leverages the logits of a draft model to predict the speculative accuracy of tokens and employs a theoretically optimal algorithm to construct token trees for verification. To accommodate diverse SLO requirements without compromising throughput, AdaServe employs a speculation-and-selection scheme that first constructs candidate token trees for each request and then dynamically selects tokens to meet individual SLO constraints while optimizing throughput. Comprehensive evaluations demonstrate that AdaServe achieves up to 73% higher SLO attainment and 74% higher goodput compared to state-of-the-art systems. These results underscore AdaServe's potential to enhance the efficiency and adaptability of LLM deployments across varied application scenarios.

Via

Access Paper or Ask Questions

Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management

Sep 05, 2024

Yujie Wang, Shenhan Zhu, Fangcheng Fu, Xupeng Miao, Jie Zhang, Juan Zhu, Fan Hong, Yong Li, Bin Cui

Figure 1 for Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management

Figure 2 for Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management

Figure 3 for Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management

Figure 4 for Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management

Abstract:Recent foundation models are capable of handling multiple machine learning (ML) tasks and multiple data modalities with the unified base model structure and several specialized model components. However, the development of such multi-task (MT) multi-modal (MM) models poses significant model management challenges to existing training systems. Due to the sophisticated model architecture and the heterogeneous workloads of different ML tasks and data modalities, training these models usually requires massive GPU resources and suffers from sub-optimal system efficiency. In this paper, we investigate how to achieve high-performance training of large-scale MT MM models through data heterogeneity-aware model management optimization. The key idea is to decompose the model execution into stages and address the joint optimization problem sequentially, including both heterogeneity-aware workload parallelization and dependency-driven execution scheduling. Based on this, we build a prototype system and evaluate it on various large MT MM models. Experiments demonstrate the superior performance and efficiency of our system, with speedup ratio up to 71% compared to state-of-the-art training systems.

Via

Access Paper or Ask Questions

GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Jun 24, 2024

Byungsoo Jeon, Mengdi Wu, Shiyi Cao, Sunghyun Kim, Sunghyun Park, Neeraj Aggarwal, Colin Unger, Daiyaan Arfeen, Peiyuan Liao, Xupeng Miao(+4 more)

Figure 1 for GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Figure 2 for GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Figure 3 for GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Figure 4 for GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Abstract:Deep neural networks (DNNs) continue to grow rapidly in size, making them infeasible to train on a single device. Pipeline parallelism is commonly used in existing DNN systems to support large-scale DNN training by partitioning a DNN into multiple stages, which concurrently perform DNN training for different micro-batches in a pipeline fashion. However, existing pipeline-parallel approaches only consider sequential pipeline stages and thus ignore the topology of a DNN, resulting in missed model-parallel opportunities. This paper presents graph pipeline parallelism (GPP), a new pipeline-parallel scheme that partitions a DNN into pipeline stages whose dependencies are identified by a directed acyclic graph. GPP generalizes existing sequential pipeline parallelism and preserves the inherent topology of a DNN to enable concurrent execution of computationally-independent operators, resulting in reduced memory requirement and improved GPU performance. In addition, we develop GraphPipe, a distributed system that exploits GPP strategies to enable performant and scalable DNN training. GraphPipe partitions a DNN into a graph of stages, optimizes micro-batch schedules for these stages, and parallelizes DNN training using the discovered GPP strategies. Evaluation on a variety of DNNs shows that GraphPipe outperforms existing pipeline-parallel systems such as PipeDream and Piper by up to 1.6X. GraphPipe also reduces the search time by 9-21X compared to PipeDream and Piper.

Via

Access Paper or Ask Questions

Optimal Kernel Orchestration for Tensor Programs with Korch

Jun 13, 2024

Muyan Hu, Ashwin Venkatram, Shreyashri Biswas, Balamurugan Marimuthu, Bohan Hou, Gabriele Oliaro, Haojie Wang, Liyan Zheng, Xupeng Miao, Jidong Zhai

Figure 1 for Optimal Kernel Orchestration for Tensor Programs with Korch

Figure 2 for Optimal Kernel Orchestration for Tensor Programs with Korch

Figure 3 for Optimal Kernel Orchestration for Tensor Programs with Korch

Figure 4 for Optimal Kernel Orchestration for Tensor Programs with Korch

Abstract:Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of optimization opportunities in kernel orchestration. This paper presents Korch, a tensor program optimizer that discovers optimal kernel orchestration strategies for tensor programs. Instead of directly fusing operators, Korch first applies operator fission to decompose tensor operators into a small set of basic tensor algebra primitives. This decomposition enables a diversity of fine-grained, inter-operator optimizations. Next, Korch optimizes kernel orchestration by formalizing it as a constrained optimization problem, leveraging an off-the-shelf binary linear programming solver to discover an optimal orchestration strategy, and generating an executable that can be directly deployed on modern GPU platforms. Evaluation on a variety of DNNs shows that Korch outperforms existing tensor program optimizers by up to 1.7x on V100 GPUs and up to 1.6x on A100 GPUs. Korch is publicly available at https://github.com/humuyan/Korch.

* Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems 3 (2024) 755-769
* Fix some typos in the ASPLOS version

Via

Access Paper or Ask Questions

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Jun 03, 2024

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, Rashmi Vinayak

Figure 1 for Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Figure 2 for Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Figure 3 for Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Figure 4 for Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Abstract:This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving on heterogeneous GPU clusters. A key idea behind Helix is to formulate inference computation of LLMs over heterogeneous GPUs and network connections as a max-flow problem for a directed, weighted graph, whose nodes represent GPU instances and edges capture both GPU and network heterogeneity through their capacities. Helix then uses a mixed integer linear programming (MILP) algorithm to discover highly optimized strategies to serve LLMs. This approach allows Helix to jointly optimize model placement and request scheduling, two highly entangled tasks in heterogeneous LLM serving. Our evaluation on several heterogeneous cluster settings ranging from 24 to 42 GPU nodes shows that Helix improves serving throughput by up to 2.7$\times$ and reduces prompting and decoding latency by up to 2.8$\times$ and 1.3$\times$, respectively, compared to best existing approaches.

Via

Access Paper or Ask Questions

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

Feb 29, 2024

Xupeng Miao, Gabriele Oliaro, Xinhao Cheng, Mengdi Wu, Colin Unger, Zhihao Jia

Figure 1 for FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

Figure 2 for FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

Figure 3 for FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

Figure 4 for FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

Abstract:Parameter-efficient finetuning (PEFT) is a widely used technique to adapt large language models for different tasks. Service providers typically create separate systems for users to perform PEFT model finetuning and inference tasks. This is because existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests. As a result, shared GPU resources are underutilized, leading to inefficiencies. To address this problem, we present FlexLLM, the first system that can serve inference and parameter-efficient finetuning requests in the same iteration. Our system leverages the complementary nature of these two tasks and utilizes shared GPU resources to run them jointly, using a method called co-serving. To achieve this, FlexLLM introduces a novel token-level finetuning mechanism, which breaks down the finetuning computation of a sequence into smaller token-level computations and uses dependent parallelization and graph pruning, two static compilation optimizations, to minimize the memory overhead and latency for co-serving. Compared to existing systems, FlexLLM's co-serving approach reduces the activation GPU memory overhead by up to 8x, and the end-to-end GPU memory requirement of finetuning by up to 36% while maintaining a low inference latency and improving finetuning throughput. For example, under a heavy inference workload, FlexLLM can still preserve more than 80% of the peak finetuning throughput, whereas existing systems cannot make any progress with finetuning. The source code of FlexLLM is publicly available at https://github.com/flexflow/FlexFlow.

Via

Access Paper or Ask Questions

Generative Dense Retrieval: Memory Can Be a Burden

Jan 19, 2024

Peiwen Yuan, Xinglin Wang, Shaoxiong Feng, Boyuan Pan, Yiwei Li, Heda Wang, Xupeng Miao, Kan Li

Figure 1 for Generative Dense Retrieval: Memory Can Be a Burden

Figure 2 for Generative Dense Retrieval: Memory Can Be a Burden

Figure 3 for Generative Dense Retrieval: Memory Can Be a Burden

Figure 4 for Generative Dense Retrieval: Memory Can Be a Burden

Abstract:Generative Retrieval (GR), autoregressively decoding relevant document identifiers given a query, has been shown to perform well under the setting of small-scale corpora. By memorizing the document corpus with model parameters, GR implicitly achieves deep interaction between query and document. However, such a memorizing mechanism faces three drawbacks: (1) Poor memory accuracy for fine-grained features of documents; (2) Memory confusion gets worse as the corpus size increases; (3) Huge memory update costs for new documents. To alleviate these problems, we propose the Generative Dense Retrieval (GDR) paradigm. Specifically, GDR first uses the limited memory volume to achieve inter-cluster matching from query to relevant document clusters. Memorizing-free matching mechanism from Dense Retrieval (DR) is then introduced to conduct fine-grained intra-cluster matching from clusters to relevant documents. The coarse-to-fine process maximizes the advantages of GR's deep interaction and DR's scalability. Besides, we design a cluster identifier constructing strategy to facilitate corpus memory and a cluster-adaptive negative sampling strategy to enhance the intra-cluster mapping ability. Empirical results show that GDR obtains an average of 3.0 R@100 improvement on NQ dataset under multiple settings and has better scalability.

* EACL 2024 main
* EACL 2024 main

Via

Access Paper or Ask Questions

Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

Jan 13, 2024

Zhengxin Zhang, Dan Zhao, Xupeng Miao, Gabriele Oliaro, Qing Li, Yong Jiang, Zhihao Jia

Figure 1 for Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

Figure 2 for Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

Figure 3 for Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

Figure 4 for Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

Abstract:Finetuning large language models (LLMs) has been empirically effective on a variety of downstream tasks. Existing approaches to finetuning an LLM either focus on parameter-efficient finetuning, which only updates a small number of trainable parameters, or attempt to reduce the memory footprint during the training phase of the finetuning. Typically, the memory footprint during finetuning stems from three contributors: model weights, optimizer states, and intermediate activations. However, existing works still require considerable memory and none can simultaneously mitigate memory footprint for all three sources. In this paper, we present Quantized Side Tuing (QST), which enables memory-efficient and fast finetuning of LLMs by operating through a dual-stage process. First, QST quantizes an LLM's model weights into 4-bit to reduce the memory footprint of the LLM's original weights; QST also introduces a side network separated from the LLM, which utilizes the hidden states of the LLM to make task-specific predictions. Using a separate side network avoids performing backpropagation through the LLM, thus reducing the memory requirement of the intermediate activations. Furthermore, QST leverages several low-rank adaptors and gradient-free downsample modules to significantly reduce the trainable parameters, so as to save the memory footprint of the optimizer states. Experiments show that QST can reduce the total memory footprint by up to 2.3 $\times$ and speed up the finetuning process by up to 3 $\times$ while achieving competent performance compared with the state-of-the-art. When it comes to full finetuning, QST can reduce the total memory footprint up to 7 $\times$.

Via

Access Paper or Ask Questions

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

Dec 23, 2023

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia

Figure 1 for Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

Figure 2 for Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

Figure 3 for Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

Figure 4 for Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

Abstract:In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby reshaping the future of AI.

Via

Access Paper or Ask Questions

Experimental Analysis of Large-scale Learnable Vector Storage Compression

Nov 27, 2023

Hailin Zhang, Penghao Zhao, Xupeng Miao, Yingxia Shao, Zirui Liu, Tong Yang, Bin Cui

Figure 1 for Experimental Analysis of Large-scale Learnable Vector Storage Compression

Figure 2 for Experimental Analysis of Large-scale Learnable Vector Storage Compression

Figure 3 for Experimental Analysis of Large-scale Learnable Vector Storage Compression

Figure 4 for Experimental Analysis of Large-scale Learnable Vector Storage Compression

Abstract:Learnable embedding vector is one of the most important applications in machine learning, and is widely used in various database-related domains. However, the high dimensionality of sparse data in recommendation tasks and the huge volume of corpus in retrieval-related tasks lead to a large memory consumption of the embedding table, which poses a great challenge to the training and deployment of models. Recent research has proposed various methods to compress the embeddings at the cost of a slight decrease in model quality or the introduction of other overheads. Nevertheless, the relative performance of these methods remains unclear. Existing experimental comparisons only cover a subset of these methods and focus on limited metrics. In this paper, we perform a comprehensive comparative analysis and experimental evaluation of embedding compression. We introduce a new taxonomy that categorizes these techniques based on their characteristics and methodologies, and further develop a modular benchmarking framework that integrates 14 representative methods. Under a uniform test environment, our benchmark fairly evaluates each approach, presents their strengths and weaknesses under different memory budgets, and recommends the best method based on the use case. In addition to providing useful guidelines, our study also uncovers the limitations of current methods and suggests potential directions for future research.

Via

Access Paper or Ask Questions