Picture for Wulong Liu

Wulong Liu

Faster and Better LLMs via Latency-Aware Test-Time Scaling

Add code
May 26, 2025
Viaarxiv icon

PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval

Add code
May 23, 2025
Viaarxiv icon

TrimR: Verifier-based Training-Free Thinking Compression for Efficient Test-Time Scaling

Add code
May 22, 2025
Viaarxiv icon

Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

Add code
May 07, 2025
Viaarxiv icon

Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs

Add code
Apr 10, 2025
Viaarxiv icon

FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression

Add code
Feb 22, 2025
Viaarxiv icon

KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference

Add code
Feb 06, 2025
Figure 1 for KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
Figure 2 for KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
Figure 3 for KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
Figure 4 for KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
Viaarxiv icon

AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference

Add code
Feb 06, 2025
Viaarxiv icon

CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference

Add code
Feb 06, 2025
Viaarxiv icon

MixPE: Quantization and Hardware Co-design for Efficient LLM Inference

Add code
Nov 25, 2024
Viaarxiv icon