Picture for Zhihao Jia

Zhihao Jia

GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Add code
Jun 24, 2024
Figure 1 for GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
Figure 2 for GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
Figure 3 for GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
Figure 4 for GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
Viaarxiv icon

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

Add code
Jun 04, 2024
Viaarxiv icon

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Add code
Jun 03, 2024
Viaarxiv icon

A Multi-Level Superoptimizer for Tensor Programs

Add code
May 09, 2024
Figure 1 for A Multi-Level Superoptimizer for Tensor Programs
Figure 2 for A Multi-Level Superoptimizer for Tensor Programs
Figure 3 for A Multi-Level Superoptimizer for Tensor Programs
Figure 4 for A Multi-Level Superoptimizer for Tensor Programs
Viaarxiv icon

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

Add code
Feb 29, 2024
Figure 1 for FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
Figure 2 for FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
Figure 3 for FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
Figure 4 for FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning
Viaarxiv icon

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

Add code
Feb 29, 2024
Viaarxiv icon

Accelerating Retrieval-Augmented Language Model Serving with Speculation

Add code
Jan 25, 2024
Viaarxiv icon

Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

Add code
Jan 13, 2024
Figure 1 for Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models
Figure 2 for Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models
Figure 3 for Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models
Figure 4 for Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models
Viaarxiv icon

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

Add code
Dec 23, 2023
Figure 1 for Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
Figure 2 for Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
Figure 3 for Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
Figure 4 for Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
Viaarxiv icon

SpotServe: Serving Generative Large Language Models on Preemptible Instances

Add code
Nov 27, 2023
Figure 1 for SpotServe: Serving Generative Large Language Models on Preemptible Instances
Figure 2 for SpotServe: Serving Generative Large Language Models on Preemptible Instances
Figure 3 for SpotServe: Serving Generative Large Language Models on Preemptible Instances
Figure 4 for SpotServe: Serving Generative Large Language Models on Preemptible Instances
Viaarxiv icon