Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jonah Yi

KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

May 07, 2024

Tianyi Zhang, Jonah Yi, Zhaozhuo Xu, Anshumali Shrivastava

Figure 1 for KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

Figure 2 for KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

Figure 3 for KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

Figure 4 for KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization

Abstract:Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become the main contributor to GPU memory usage and the bottleneck of inference latency. Quantization has emerged as an effective technique for KV cache compression, but existing methods still fail at very low bit widths. We observe that distinct channels of a key/value activation embedding are highly inter-dependent, and the joint entropy of multiple channels grows at a slower rate than the sum of their marginal entropies. Based on this insight, we propose Coupled Quantization (CQ), which couples multiple key/value channels together to exploit their inter-dependency and encode the activations in a more information-efficient manner. Extensive experiments reveal that CQ outperforms or is competitive with existing baselines in preserving model quality. Furthermore, we demonstrate that CQ can preserve model quality with KV cache quantized down to 1-bit.

Via

Access Paper or Ask Questions

CAPS: A Practical Partition Index for Filtered Similarity Search

Aug 29, 2023

Gaurav Gupta, Jonah Yi, Benjamin Coleman, Chen Luo, Vihan Lakshman, Anshumali Shrivastava

Figure 1 for CAPS: A Practical Partition Index for Filtered Similarity Search

Figure 2 for CAPS: A Practical Partition Index for Filtered Similarity Search

Figure 3 for CAPS: A Practical Partition Index for Filtered Similarity Search

Figure 4 for CAPS: A Practical Partition Index for Filtered Similarity Search

Abstract:With the surging popularity of approximate near-neighbor search (ANNS), driven by advances in neural representation learning, the ability to serve queries accompanied by a set of constraints has become an area of intense interest. While the community has recently proposed several algorithms for constrained ANNS, almost all of these methods focus on integration with graph-based indexes, the predominant class of algorithms achieving state-of-the-art performance in latency-recall tradeoffs. In this work, we take a different approach and focus on developing a constrained ANNS algorithm via space partitioning as opposed to graphs. To that end, we introduce Constrained Approximate Partitioned Search (CAPS), an index for ANNS with filters via space partitions that not only retains the benefits of a partition-based algorithm but also outperforms state-of-the-art graph-based constrained search techniques in recall-latency tradeoffs, with only 10% of the index size.

* 14 pages

Via

Access Paper or Ask Questions