Picture for Zhihao Jia

Zhihao Jia

Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs

Add code
Dec 22, 2025
Viaarxiv icon

Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning

Add code
Aug 09, 2025
Viaarxiv icon

Nexus: Taming Throughput-Latency Tradeoff in LLM Serving via Efficient GPU Sharing

Add code
Jul 09, 2025
Figure 1 for Nexus: Taming Throughput-Latency Tradeoff in LLM Serving via Efficient GPU Sharing
Figure 2 for Nexus: Taming Throughput-Latency Tradeoff in LLM Serving via Efficient GPU Sharing
Figure 3 for Nexus: Taming Throughput-Latency Tradeoff in LLM Serving via Efficient GPU Sharing
Figure 4 for Nexus: Taming Throughput-Latency Tradeoff in LLM Serving via Efficient GPU Sharing
Viaarxiv icon

DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation

Add code
May 24, 2025
Figure 1 for DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation
Figure 2 for DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation
Figure 3 for DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation
Figure 4 for DDO: Dual-Decision Optimization via Multi-Agent Collaboration for LLM-Based Medical Consultation
Viaarxiv icon

SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning

Add code
Apr 10, 2025
Figure 1 for SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning
Figure 2 for SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning
Figure 3 for SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning
Figure 4 for SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning
Viaarxiv icon

AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding

Add code
Jan 21, 2025
Viaarxiv icon

Communication Bounds for the Distributed Experts Problem

Add code
Jan 06, 2025
Viaarxiv icon

SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference

Add code
Nov 07, 2024
Figure 1 for SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference
Figure 2 for SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference
Figure 3 for SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference
Figure 4 for SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference
Viaarxiv icon

MagicPIG: LSH Sampling for Efficient LLM Generation

Add code
Oct 21, 2024
Figure 1 for MagicPIG: LSH Sampling for Efficient LLM Generation
Figure 2 for MagicPIG: LSH Sampling for Efficient LLM Generation
Figure 3 for MagicPIG: LSH Sampling for Efficient LLM Generation
Figure 4 for MagicPIG: LSH Sampling for Efficient LLM Generation
Viaarxiv icon

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Add code
Oct 07, 2024
Viaarxiv icon