Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Renji Zhang

KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference

Mar 17, 2025

Huan Yang, Renji Zhang, Deyu Zhang

Figure 1 for KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference

Figure 2 for KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference

Figure 3 for KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference

Figure 4 for KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference

Abstract:This paper presents KVShare, a multi-user Key-Value (KV) Cache sharing technology based on semantic similarity, designed to enhance the inference efficiency of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Addressing the limitations of existing prefix caching (strict text prefix matching) and semantic caching (loss of response diversity), KVShare achieves fine-grained KV cache reuse through semantic alignment algorithms and differential editing operations. Experiments on real-world user conversation datasets demonstrate that KVShare improves KV cache hit rates by over 60%, while maintaining output quality comparable to full computation (no significant degradation in BLEU and Rouge-L metrics). This approach effectively reduces GPU resource consumption and is applicable to scenarios with repetitive queries, such as healthcare and education.

Via

Access Paper or Ask Questions