Picture for Ion Stoica

Ion Stoica

BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Add code
Nov 25, 2024
Viaarxiv icon

MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

Add code
Nov 18, 2024
Viaarxiv icon

Pie: Pooling CPU Memory for LLM Inference

Add code
Nov 14, 2024
Figure 1 for Pie: Pooling CPU Memory for LLM Inference
Figure 2 for Pie: Pooling CPU Memory for LLM Inference
Figure 3 for Pie: Pooling CPU Memory for LLM Inference
Figure 4 for Pie: Pooling CPU Memory for LLM Inference
Viaarxiv icon

SkyServe: Serving AI Models across Regions and Clouds with Spot Instances

Add code
Nov 03, 2024
Viaarxiv icon

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Add code
Nov 02, 2024
Figure 1 for NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Figure 2 for NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Figure 3 for NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Figure 4 for NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Viaarxiv icon

Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving

Add code
Oct 21, 2024
Viaarxiv icon

How to Evaluate Reward Models for RLHF

Add code
Oct 18, 2024
Viaarxiv icon

JudgeBench: A Benchmark for Evaluating LLM-based Judges

Add code
Oct 16, 2024
Figure 1 for JudgeBench: A Benchmark for Evaluating LLM-based Judges
Figure 2 for JudgeBench: A Benchmark for Evaluating LLM-based Judges
Figure 3 for JudgeBench: A Benchmark for Evaluating LLM-based Judges
Figure 4 for JudgeBench: A Benchmark for Evaluating LLM-based Judges
Viaarxiv icon

Efficient LLM Scheduling by Learning to Rank

Add code
Aug 28, 2024
Figure 1 for Efficient LLM Scheduling by Learning to Rank
Figure 2 for Efficient LLM Scheduling by Learning to Rank
Figure 3 for Efficient LLM Scheduling by Learning to Rank
Figure 4 for Efficient LLM Scheduling by Learning to Rank
Viaarxiv icon

Post-Training Sparse Attention with Double Sparsity

Add code
Aug 11, 2024
Viaarxiv icon