Picture for Ion Stoica

Ion Stoica

BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

Add code
Nov 25, 2024
Figure 1 for BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
Figure 2 for BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
Figure 3 for BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
Figure 4 for BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
Viaarxiv icon

MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

Add code
Nov 18, 2024
Figure 1 for MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
Figure 2 for MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
Figure 3 for MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
Figure 4 for MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
Viaarxiv icon

Pie: Pooling CPU Memory for LLM Inference

Add code
Nov 14, 2024
Figure 1 for Pie: Pooling CPU Memory for LLM Inference
Figure 2 for Pie: Pooling CPU Memory for LLM Inference
Figure 3 for Pie: Pooling CPU Memory for LLM Inference
Figure 4 for Pie: Pooling CPU Memory for LLM Inference
Viaarxiv icon

SkyServe: Serving AI Models across Regions and Clouds with Spot Instances

Add code
Nov 03, 2024
Figure 1 for SkyServe: Serving AI Models across Regions and Clouds with Spot Instances
Figure 2 for SkyServe: Serving AI Models across Regions and Clouds with Spot Instances
Figure 3 for SkyServe: Serving AI Models across Regions and Clouds with Spot Instances
Figure 4 for SkyServe: Serving AI Models across Regions and Clouds with Spot Instances
Viaarxiv icon

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Add code
Nov 02, 2024
Figure 1 for NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Figure 2 for NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Figure 3 for NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Figure 4 for NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Viaarxiv icon

Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving

Add code
Oct 21, 2024
Figure 1 for Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving
Figure 2 for Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving
Figure 3 for Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving
Figure 4 for Managing Bandwidth: The Key to Cloud-Assisted Autonomous Driving
Viaarxiv icon

How to Evaluate Reward Models for RLHF

Add code
Oct 18, 2024
Figure 1 for How to Evaluate Reward Models for RLHF
Figure 2 for How to Evaluate Reward Models for RLHF
Figure 3 for How to Evaluate Reward Models for RLHF
Figure 4 for How to Evaluate Reward Models for RLHF
Viaarxiv icon

JudgeBench: A Benchmark for Evaluating LLM-based Judges

Add code
Oct 16, 2024
Figure 1 for JudgeBench: A Benchmark for Evaluating LLM-based Judges
Figure 2 for JudgeBench: A Benchmark for Evaluating LLM-based Judges
Figure 3 for JudgeBench: A Benchmark for Evaluating LLM-based Judges
Figure 4 for JudgeBench: A Benchmark for Evaluating LLM-based Judges
Viaarxiv icon

Efficient LLM Scheduling by Learning to Rank

Add code
Aug 28, 2024
Figure 1 for Efficient LLM Scheduling by Learning to Rank
Figure 2 for Efficient LLM Scheduling by Learning to Rank
Figure 3 for Efficient LLM Scheduling by Learning to Rank
Figure 4 for Efficient LLM Scheduling by Learning to Rank
Viaarxiv icon

Post-Training Sparse Attention with Double Sparsity

Add code
Aug 11, 2024
Figure 1 for Post-Training Sparse Attention with Double Sparsity
Figure 2 for Post-Training Sparse Attention with Double Sparsity
Figure 3 for Post-Training Sparse Attention with Double Sparsity
Figure 4 for Post-Training Sparse Attention with Double Sparsity
Viaarxiv icon