Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shay Vargaftik

VMware Research

HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference

Feb 05, 2025

Zeyu Zhang, Haiying Shen, Shay Vargaftik, Ran Ben Basat, Michael Mitzenmacher, Minlan Yu

Figure 1 for HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference

Figure 2 for HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference

Figure 3 for HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference

Figure 4 for HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference

Abstract:Disaggregated Large Language Model (LLM) inference has gained popularity as it separates the computation-intensive prefill stage from the memory-intensive decode stage, avoiding the prefill-decode interference and improving resource utilization. However, transmitting Key-Value (KV) data between the two stages can be a bottleneck, especially for long prompts. Additionally, the computation time overhead for prefill and decode is key for optimizing Job Completion Time (JCT), and KV data size can become prohibitive for long prompts and sequences. Existing KV quantization methods can alleviate the transmission bottleneck and reduce memory requirements, but they introduce significant dequantization overhead, exacerbating the computation time. We propose Homomorphic Acceleration via Compression of the KV cache (HACK) for disaggregated LLM inference. HACK eliminates the heavy KV dequantization step, and directly performs computations on quantized KV data to approximate and reduce the cost of the expensive matrix-multiplication step. Extensive trace-driven experiments show that HACK reduces JCT by up to 70.9% compared to disaggregated LLM inference baseline and by up to 52.3% compared to state-of-the-art KV quantization methods.

Via

Access Paper or Ask Questions

Lucy: Think and Reason to Solve Text-to-SQL

Jul 06, 2024

Nina Narodytska, Shay Vargaftik

Figure 1 for Lucy: Think and Reason to Solve Text-to-SQL

Figure 2 for Lucy: Think and Reason to Solve Text-to-SQL

Figure 3 for Lucy: Think and Reason to Solve Text-to-SQL

Figure 4 for Lucy: Think and Reason to Solve Text-to-SQL

Abstract:Large Language Models (LLMs) have made significant progress in assisting users to query databases in natural language. While LLM-based techniques provide state-of-the-art results on many standard benchmarks, their performance significantly drops when applied to large enterprise databases. The reason is that these databases have a large number of tables with complex relationships that are challenging for LLMs to reason about. We analyze challenges that LLMs face in these settings and propose a new solution that combines the power of LLMs in understanding questions with automated reasoning techniques to handle complex database constraints. Based on these ideas, we have developed a new framework that outperforms state-of-the-art techniques in zero-shot text-to-SQL on complex benchmarks

Via

Access Paper or Ask Questions

Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression

Jul 01, 2024

Wenchen Han, Shay Vargaftik, Michael Mitzenmacher, Brad Karp, Ran Ben Basat

Figure 1 for Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression

Figure 2 for Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression

Figure 3 for Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression

Figure 4 for Beyond Throughput and Compression Ratios: Towards High End-to-end Utility of Gradient Compression

Abstract:Gradient aggregation has long been identified as a major bottleneck in today's large-scale distributed machine learning training systems. One promising solution to mitigate such bottlenecks is gradient compression, directly reducing communicated gradient data volume. However, in practice, many gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy. In this work, we identify several common issues in previous gradient compression systems and evaluation methods. These issues include excessive computational overheads; incompatibility with all-reduce; and inappropriate evaluation metrics, such as not using an end-to-end metric or using a 32-bit baseline instead of a 16-bit baseline. We propose several general design and evaluation techniques to address these issues and provide guidelines for future work. Our preliminary evaluation shows that our techniques enhance the system's performance and provide a clearer understanding of the end-to-end utility of gradient compression methods.

* 9 pages, 3 figures

Via

Access Paper or Ask Questions

Optimal and Near-Optimal Adaptive Vector Quantization

Feb 05, 2024

Ran Ben-Basat, Yaniv Ben-Itzhak, Michael Mitzenmacher, Shay Vargaftik

Figure 1 for Optimal and Near-Optimal Adaptive Vector Quantization

Figure 2 for Optimal and Near-Optimal Adaptive Vector Quantization

Figure 3 for Optimal and Near-Optimal Adaptive Vector Quantization

Figure 4 for Optimal and Near-Optimal Adaptive Vector Quantization

Abstract:Quantization is a fundamental optimization for many machine-learning use cases, including compressing gradients, model weights and activations, and datasets. The most accurate form of quantization is \emph{adaptive}, where the error is minimized with respect to a given input, rather than optimizing for the worst case. However, optimal adaptive quantization methods are considered infeasible in terms of both their runtime and memory requirements. We revisit the Adaptive Vector Quantization (AVQ) problem and present algorithms that find optimal solutions with asymptotically improved time and space complexity. We also present an even faster near-optimal algorithm for large inputs. Our experiments show our algorithms may open the door to using AVQ more extensively in a variety of machine learning applications.

Via

Access Paper or Ask Questions

THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression

Feb 16, 2023

Minghao Li, Ran Ben Basat, Shay Vargaftik, ChonLam Lao, Kevin Xu, Xinran Tang, Michael Mitzenmacher, Minlan Yu

Figure 1 for THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression

Figure 2 for THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression

Figure 3 for THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression

Figure 4 for THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression

Abstract:Deep neural networks (DNNs) are the de-facto standard for essential use cases, such as image classification, computer vision, and natural language processing. As DNNs and datasets get larger, they require distributed training on increasingly larger clusters. A main bottleneck is then the resulting communication overhead where workers exchange model updates (i.e., gradients) on a per-round basis. To address this bottleneck and accelerate training, a widely-deployed approach is compression. However, previous deployments often apply bi-directional compression schemes by simply using a uni-directional gradient compression scheme in each direction. This results in significant computational overheads at the parameter server and increased compression error, leading to longer training and lower accuracy. We introduce Tensor Homomorphic Compression (THC), a novel bi-directional compression framework that enables the direct aggregation of compressed values while optimizing the bandwidth to accuracy tradeoff, thus eliminating the aforementioned overheads. Moreover, THC is compatible with in-network aggregation (INA), which allows for further acceleration. Evaluation over a testbed shows that THC improves time-to-accuracy in comparison to alternatives by up to 1.32x with a software PS and up to 1.51x using INA. Finally, we demonstrate that THC is scalable and tolerant for acceptable packet-loss rates.

* 13 pages body, 18 pages total

Via

Access Paper or Ask Questions

$\texttt{DoCoFL}$: Downlink Compression for Cross-Device Federated Learning

Feb 01, 2023

Ron Dorfman, Shay Vargaftik, Yaniv Ben-Itzhak, Kfir Y. Levy

$Figure 1 for $\texttt{DoCoFL}$: Downlink Compression for Cross-Device Federated Learning$

$Figure 2 for $\texttt{DoCoFL}$: Downlink Compression for Cross-Device Federated Learning$

$Figure 3 for $\texttt{DoCoFL}$: Downlink Compression for Cross-Device Federated Learning$

$Figure 4 for $\texttt{DoCoFL}$: Downlink Compression for Cross-Device Federated Learning$

Abstract:Many compression techniques have been proposed to reduce the communication overhead of Federated Learning training procedures. However, these are typically designed for compressing model updates, which are expected to decay throughout training. As a result, such methods are inapplicable to downlink (i.e., from the parameter server to clients) compression in the cross-device setting, where heterogeneous clients $\textit{may appear only once}$ during training and thus must download the model parameters. In this paper, we propose a new framework ($\texttt{DoCoFL}$) for downlink compression in the cross-device federated learning setting. Importantly, $\texttt{DoCoFL}$ can be seamlessly combined with many uplink compression schemes, rendering it suitable for bi-directional compression. Through extensive evaluation, we demonstrate that $\texttt{DoCoFL}$ offers significant bi-directional bandwidth reduction while achieving competitive accuracy to that of $\texttt{FedAvg}$ without compression.

Via

Access Paper or Ask Questions

ScionFL: Secure Quantized Aggregation for Federated Learning

Oct 13, 2022

Yaniv Ben-Itzhak, Helen Möllering, Benny Pinkas, Thomas Schneider, Ajith Suresh, Oleksandr Tkachenko, Shay Vargaftik, Christian Weinert, Hossein Yalame, Avishay Yanai

Figure 1 for ScionFL: Secure Quantized Aggregation for Federated Learning

Figure 2 for ScionFL: Secure Quantized Aggregation for Federated Learning

Figure 3 for ScionFL: Secure Quantized Aggregation for Federated Learning

Figure 4 for ScionFL: Secure Quantized Aggregation for Federated Learning

Abstract:Privacy concerns in federated learning (FL) are commonly addressed with secure aggregation schemes that prevent a central party from observing plaintext client updates. However, most such schemes neglect orthogonal FL research that aims at reducing communication between clients and the aggregator and is instrumental in facilitating cross-device FL with thousands and even millions of (mobile) participants. In particular, quantization techniques can typically reduce client-server communication by a factor of 32x. In this paper, we unite both research directions by introducing an efficient secure aggregation framework based on outsourced multi-party computation (MPC) that supports any linear quantization scheme. Specifically, we design a novel approximate version of an MPC-based secure aggregation protocol with support for multiple stochastic quantization schemes, including ones that utilize the randomized Hadamard transform and Kashin's representation. In our empirical performance evaluation, we show that with no additional overhead for clients and moderate inter-server communication, we achieve similar training accuracy as insecure schemes for standard FL benchmarks. Beyond this, we present an efficient extension to our secure quantized aggregation framework that effectively defends against state-of-the-art untargeted poisoning attacks.

Via

Access Paper or Ask Questions

QUIC-FL: Quick Unbiased Compression for Federated Learning

May 28, 2022

Ran Ben Basat, Shay Vargaftik, Amit Portnoy, Gil Einziger, Yaniv Ben-Itzhak, Michael Mitzenmacher

Figure 1 for QUIC-FL: Quick Unbiased Compression for Federated Learning

Figure 2 for QUIC-FL: Quick Unbiased Compression for Federated Learning

Figure 3 for QUIC-FL: Quick Unbiased Compression for Federated Learning

Figure 4 for QUIC-FL: Quick Unbiased Compression for Federated Learning

Abstract:Distributed Mean Estimation (DME) is a fundamental building block in communication efficient federated learning. In DME, clients communicate their lossily compressed gradients to the parameter server, which estimates the average and updates the model. State of the art DME techniques apply either unbiased quantization methods, resulting in large estimation errors, or biased quantization methods, where unbiasing the result requires that the server decodes each gradient individually, which markedly slows the aggregation time. In this paper, we propose QUIC-FL, a DME algorithm that achieves the best of all worlds. QUIC-FL is unbiased, offers fast aggregation time, and is competitive with the most accurate (slow aggregation) DME techniques. To achieve this, we formalize the problem in a novel way that allows us to use standard solvers to design near-optimal unbiased quantization schemes.

Via

Access Paper or Ask Questions

Automating In-Network Machine Learning

May 18, 2022

Changgang Zheng, Mingyuan Zang, Xinpeng Hong, Riyad Bensoussane, Shay Vargaftik, Yaniv Ben-Itzhak, Noa Zilberman

Figure 1 for Automating In-Network Machine Learning

Figure 2 for Automating In-Network Machine Learning

Figure 3 for Automating In-Network Machine Learning

Figure 4 for Automating In-Network Machine Learning

Abstract:Using programmable network devices to aid in-network machine learning has been the focus of significant research. However, most of the research was of a limited scope, providing a proof of concept or describing a closed-source algorithm. To date, no general solution has been provided for mapping machine learning algorithms to programmable network devices. In this paper, we present Planter, an open-source, modular framework for mapping trained machine learning models to programmable devices. Planter supports a wide range of machine learning models, multiple targets and can be easily extended. The evaluation of Planter compares different mapping approaches, and demonstrates the feasibility, performance, and resource efficiency for applications such as anomaly detection, financial transactions, and quality of experience. The results show that Planter-based in-network machine learning algorithms can run at line rate, have a negligible effect on latency, coexist with standard switching functionality, and have no or minor accuracy trade-offs.

* (13 pages body, 19 pages total, 18 figures)

Via

Access Paper or Ask Questions

IIsy: Practical In-Network Classification

May 17, 2022

Changgang Zheng, Zhaoqi Xiong, Thanh T Bui, Siim Kaupmees, Riyad Bensoussane, Antoine Bernabeu, Shay Vargaftik, Yaniv Ben-Itzhak, Noa Zilberman

Figure 1 for IIsy: Practical In-Network Classification

Figure 2 for IIsy: Practical In-Network Classification

Figure 3 for IIsy: Practical In-Network Classification

Figure 4 for IIsy: Practical In-Network Classification

Abstract:The rat race between user-generated data and data-processing systems is currently won by data. The increased use of machine learning leads to further increase in processing requirements, while data volume keeps growing. To win the race, machine learning needs to be applied to the data as it goes through the network. In-network classification of data can reduce the load on servers, reduce response time and increase scalability. In this paper, we introduce IIsy, implementing machine learning classification models in a hybrid fashion using off-the-shelf network devices. IIsy targets three main challenges of in-network classification: (i) mapping classification models to network devices (ii) extracting the required features and (iii) addressing resource and functionality constraints. IIsy supports a range of traditional and ensemble machine learning models, scaling independently of the number of stages in a switch pipeline. Moreover, we demonstrate the use of IIsy for hybrid classification, where a small model is implemented on a switch and a large model at the backend, achieving near optimal classification results, while significantly reducing latency and load on the servers.

* (14 pages body, 19 pages total, 19 figures)

Via

Access Paper or Ask Questions