Grouped Query Attention


Focus Directions Make Your Language Models Pay More Attention to Relevant Contexts

Add code
Mar 30, 2025
Viaarxiv icon

Cost-Optimal Grouped-Query Attention for Long-Context LLMs

Add code
Mar 12, 2025
Viaarxiv icon

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Add code
Mar 03, 2025
Viaarxiv icon

CDKFormer: Contextual Deviation Knowledge-Based Transformer for Long-Tail Trajectory Prediction

Add code
Mar 16, 2025
Viaarxiv icon

From Slices to Sequences: Autoregressive Tracking Transformer for Cohesive and Consistent 3D Lymph Node Detection in CT Scans

Add code
Mar 11, 2025
Viaarxiv icon

Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs

Add code
Feb 20, 2025
Viaarxiv icon

RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

Add code
Feb 19, 2025
Viaarxiv icon

TransMLA: Multi-head Latent Attention Is All You Need

Add code
Feb 11, 2025
Viaarxiv icon

FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation

Add code
Feb 03, 2025
Viaarxiv icon

What's in a Query: Polarity-Aware Distribution-Based Fair Ranking

Add code
Feb 17, 2025
Viaarxiv icon