Picture for Shiyi Cao

Shiyi Cao

Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving

Add code
May 06, 2025
Figure 1 for Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
Figure 2 for Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
Figure 3 for Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
Figure 4 for Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
Viaarxiv icon

WorldModelBench: Judging Video Generation Models As World Models

Add code
Feb 28, 2025
Figure 1 for WorldModelBench: Judging Video Generation Models As World Models
Figure 2 for WorldModelBench: Judging Video Generation Models As World Models
Figure 3 for WorldModelBench: Judging Video Generation Models As World Models
Figure 4 for WorldModelBench: Judging Video Generation Models As World Models
Viaarxiv icon

S*: Test Time Scaling for Code Generation

Add code
Feb 20, 2025
Figure 1 for S*: Test Time Scaling for Code Generation
Figure 2 for S*: Test Time Scaling for Code Generation
Figure 3 for S*: Test Time Scaling for Code Generation
Figure 4 for S*: Test Time Scaling for Code Generation
Viaarxiv icon

LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!

Add code
Feb 11, 2025
Viaarxiv icon

Locality-aware Fair Scheduling in LLM Serving

Add code
Jan 24, 2025
Figure 1 for Locality-aware Fair Scheduling in LLM Serving
Figure 2 for Locality-aware Fair Scheduling in LLM Serving
Figure 3 for Locality-aware Fair Scheduling in LLM Serving
Figure 4 for Locality-aware Fair Scheduling in LLM Serving
Viaarxiv icon

NVILA: Efficient Frontier Visual Language Models

Add code
Dec 05, 2024
Figure 1 for NVILA: Efficient Frontier Visual Language Models
Figure 2 for NVILA: Efficient Frontier Visual Language Models
Figure 3 for NVILA: Efficient Frontier Visual Language Models
Figure 4 for NVILA: Efficient Frontier Visual Language Models
Viaarxiv icon

MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

Add code
Nov 18, 2024
Figure 1 for MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
Figure 2 for MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
Figure 3 for MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
Figure 4 for MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
Viaarxiv icon

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Add code
Nov 02, 2024
Figure 1 for NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Figure 2 for NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Figure 3 for NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Figure 4 for NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Viaarxiv icon

GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism

Add code
Jun 24, 2024
Figure 1 for GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
Figure 2 for GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
Figure 3 for GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
Figure 4 for GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
Viaarxiv icon

Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models

Add code
Jun 06, 2024
Figure 1 for Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models
Figure 2 for Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models
Figure 3 for Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models
Figure 4 for Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models
Viaarxiv icon