Picture for Shouyi Yin

Shouyi Yin

Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling

Add code
Dec 24, 2025
Viaarxiv icon

cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

Add code
Dec 23, 2025
Viaarxiv icon

PADE: A Predictor-Free Sparse Attention Accelerator via Unified Execution and Stage Fusion

Add code
Dec 16, 2025
Viaarxiv icon

WATOS: Efficient LLM Training Strategies and Architecture Co-exploration for Wafer-scale Chip

Add code
Dec 13, 2025
Viaarxiv icon

MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts

Add code
Oct 14, 2025
Figure 1 for MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts
Figure 2 for MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts
Figure 3 for MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts
Figure 4 for MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts
Viaarxiv icon

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

Add code
Jun 12, 2025
Viaarxiv icon

Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels

Add code
Dec 24, 2024
Figure 1 for Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels
Figure 2 for Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels
Figure 3 for Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels
Figure 4 for Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels
Viaarxiv icon

Catch-Up Distillation: You Only Need to Train Once for Accelerating Sampling

Add code
May 21, 2023
Figure 1 for Catch-Up Distillation: You Only Need to Train Once for Accelerating Sampling
Figure 2 for Catch-Up Distillation: You Only Need to Train Once for Accelerating Sampling
Figure 3 for Catch-Up Distillation: You Only Need to Train Once for Accelerating Sampling
Figure 4 for Catch-Up Distillation: You Only Need to Train Once for Accelerating Sampling
Viaarxiv icon

HQNAS: Auto CNN deployment framework for joint quantization and architecture search

Add code
Oct 16, 2022
Figure 1 for HQNAS: Auto CNN deployment framework for joint quantization and architecture search
Figure 2 for HQNAS: Auto CNN deployment framework for joint quantization and architecture search
Figure 3 for HQNAS: Auto CNN deployment framework for joint quantization and architecture search
Figure 4 for HQNAS: Auto CNN deployment framework for joint quantization and architecture search
Viaarxiv icon

FAQS: Communication-efficient Federate DNN Architecture and Quantization Co-Search for personalized Hardware-aware Preferences

Add code
Oct 16, 2022
Figure 1 for FAQS: Communication-efficient Federate DNN Architecture and Quantization Co-Search for personalized Hardware-aware Preferences
Figure 2 for FAQS: Communication-efficient Federate DNN Architecture and Quantization Co-Search for personalized Hardware-aware Preferences
Figure 3 for FAQS: Communication-efficient Federate DNN Architecture and Quantization Co-Search for personalized Hardware-aware Preferences
Figure 4 for FAQS: Communication-efficient Federate DNN Architecture and Quantization Co-Search for personalized Hardware-aware Preferences
Viaarxiv icon