Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rachid Guerraoui

Ecole Polytechnique Federale de Lausanne, Lausanne, Switzerland

Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores

Oct 16, 2022

Arsany Guirguis, Diana Petrescu, Florin Dinu, Do Le Quoc, Javier Picorel, Rachid Guerraoui

Figure 1 for Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores

Figure 2 for Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores

Figure 3 for Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores

Figure 4 for Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores

Abstract:Storage disaggregation is fundamental to today's cloud due to cost and scalability benefits. Unfortunately, this design must cope with an inherent network bottleneck between the storage and the compute tiers. The widely deployed mitigation strategy is to provide computational resources next to storage to push down a part of an application and thus reduce the amount of data transferred to the compute tier. Overall, users of disaggregated storage need to consider two main constraints: the network may remain a bottleneck, and the storage-side computational resources are limited. This paper identifies transfer learning (TL) as a natural fit for the disaggregated cloud. TL, famously described as the next driver of ML commercial success, is widely popular and has broad-range applications. We show how to leverage the unique structure of TL's fine-tuning phase (i.e., a combination of feature extraction and training) to flexibly address the aforementioned constraints and improve both user and operator-centric metrics. The key to improving user-perceived performance is to mitigate the network bottleneck by carefully splitting the TL deep neural network (DNN) such that feature extraction is, partially or entirely, executed next to storage. Crucially, such splitting enables decoupling the batch size of feature extraction from the training batch size, facilitating efficient storage-side batch size adaptation to increase concurrency in the storage tier while avoiding out-of-memory errors. Guided by these insights, we present HAPI, a processing system for TL that spans the compute and storage tiers while remaining transparent to the user. Our evaluation with several DNNs, such as ResNet, VGG, and Transformer, shows up to 11x improvement in application runtime and up to 8.3x reduction in the data transferred from the storage to the compute tier compared to running the computation in the compute tier.

* 14 pages, 14 figures, 5 tables

Via

Access Paper or Ask Questions

SoK: On the Impossible Security of Very Large Foundation Models

Sep 30, 2022

El-Mahdi El-Mhamdi, Sadegh Farhadkhani, Rachid Guerraoui, Nirupam Gupta, Lê-Nguyên Hoang, Rafael Pinot, John Stephan

Abstract:Large machine learning models, or so-called foundation models, aim to serve as base-models for application-oriented machine learning. Although these models showcase impressive performance, they have been empirically found to pose serious security and privacy issues. We may however wonder if this is a limitation of the current models, or if these issues stem from a fundamental intrinsic impossibility of the foundation model learning problem itself. This paper aims to systematize our knowledge supporting the latter. More precisely, we identify several key features of today's foundation model learning problem which, given the current understanding in adversarial machine learning, suggest incompatibility of high accuracy with both security and privacy. We begin by observing that high accuracy seems to require (1) very high-dimensional models and (2) huge amounts of data that can only be procured through user-generated datasets. Moreover, such data is fundamentally heterogeneous, as users generally have very specific (easily identifiable) data-generating habits. More importantly, users' data is filled with highly sensitive information, and maybe heavily polluted by fake users. We then survey lower bounds on accuracy in privacy-preserving and Byzantine-resilient heterogeneous learning that, we argue, constitute a compelling case against the possibility of designing a secure and privacy-preserving high-accuracy foundation model. We further stress that our analysis also applies to other high-stake machine learning applications, including content recommendation. We conclude by calling for measures to prioritize security and privacy, and to slow down the race for ever larger models.

* 13 pages

Via

Access Paper or Ask Questions

Making Byzantine Decentralized Learning Efficient

Sep 22, 2022

Sadegh Farhadkhani, Rachid Guerraoui, Nirupam Gupta, Lê Nguyên Hoang, Rafael Pinot, John Stephan

Figure 1 for Making Byzantine Decentralized Learning Efficient

Figure 2 for Making Byzantine Decentralized Learning Efficient

Figure 3 for Making Byzantine Decentralized Learning Efficient

Figure 4 for Making Byzantine Decentralized Learning Efficient

Abstract:Decentralized-SGD (D-SGD) distributes heavy learning tasks across multiple machines (a.k.a., {\em nodes}), effectively dividing the workload per node by the size of the system. However, a handful of \emph{Byzantine} (i.e., misbehaving) nodes can jeopardize the entire learning procedure. This vulnerability is further amplified when the system is \emph{asynchronous}. Although approaches that confer Byzantine resilience to D-SGD have been proposed, these significantly impact the efficiency of the process to the point of even negating the benefit of decentralization. This naturally raises the question: \emph{can decentralized learning simultaneously enjoy Byzantine resilience and reduced workload per node?} We answer positively by proposing \newalgorithm{} that ensures Byzantine resilience without losing the computational efficiency of D-SGD. Essentially, \newalgorithm{} weakens the impact of Byzantine nodes by reducing the variance in local updates using \emph{Polyak's momentum}. Then, by establishing coordination between nodes via {\em signed echo broadcast} and a {\em nearest-neighbor averaging} scheme, we effectively tolerate Byzantine nodes whilst distributing the overhead amongst the non-Byzantine nodes. To demonstrate the correctness of our algorithm, we introduce and analyze a novel {\em Lyapunov function} that accounts for the {\em non-Markovian model drift} arising from the use of momentum. We also demonstrate the efficiency of \newalgorithm{} through experiments on several image classification tasks.

* 63 pages,5 figures

Via

Access Paper or Ask Questions

Byzantine Machine Learning Made Easy by Resilient Averaging of Momentums

May 24, 2022

Sadegh Farhadkhani, Rachid Guerraoui, Nirupam Gupta, Rafael Pinot, John Stephan

Figure 1 for Byzantine Machine Learning Made Easy by Resilient Averaging of Momentums

Figure 2 for Byzantine Machine Learning Made Easy by Resilient Averaging of Momentums

Figure 3 for Byzantine Machine Learning Made Easy by Resilient Averaging of Momentums

Figure 4 for Byzantine Machine Learning Made Easy by Resilient Averaging of Momentums

Abstract:Byzantine resilience emerged as a prominent topic within the distributed machine learning community. Essentially, the goal is to enhance distributed optimization algorithms, such as distributed SGD, in a way that guarantees convergence despite the presence of some misbehaving (a.k.a., {\em Byzantine}) workers. Although a myriad of techniques addressing the problem have been proposed, the field arguably rests on fragile foundations. These techniques are hard to prove correct and rely on assumptions that are (a) quite unrealistic, i.e., often violated in practice, and (b) heterogeneous, i.e., making it difficult to compare approaches. We present \emph{RESAM (RESilient Averaging of Momentums)}, a unified framework that makes it simple to establish optimal Byzantine resilience, relying only on standard machine learning assumptions. Our framework is mainly composed of two operators: \emph{resilient averaging} at the server and \emph{distributed momentum} at the workers. We prove a general theorem stating the convergence of distributed SGD under RESAM. Interestingly, demonstrating and comparing the convergence of many existing techniques become direct corollaries of our theorem, without resorting to stringent assumptions. We also present an empirical evaluation of the practical relevance of RESAM.

* Accepted at ICML 2022

Via

Access Paper or Ask Questions

An Equivalence Between Data Poisoning and Byzantine Gradient Attacks

Feb 17, 2022

Sadegh Farhadkhani, Rachid Guerraoui, Lê-Nguyên Hoang, Oscar Villemaud

Figure 1 for An Equivalence Between Data Poisoning and Byzantine Gradient Attacks

Figure 2 for An Equivalence Between Data Poisoning and Byzantine Gradient Attacks

Figure 3 for An Equivalence Between Data Poisoning and Byzantine Gradient Attacks

Figure 4 for An Equivalence Between Data Poisoning and Byzantine Gradient Attacks

Abstract:To study the resilience of distributed learning, the "Byzantine" literature considers a strong threat model where workers can report arbitrary gradients to the parameter server. Whereas this model helped obtain several fundamental results, it has sometimes been considered unrealistic, when the workers are mostly trustworthy machines. In this paper, we show a surprising equivalence between this model and data poisoning, a threat considered much more realistic. More specifically, we prove that every gradient attack can be reduced to data poisoning, in any personalized federated learning system with PAC guarantees (which we show are both desirable and realistic). This equivalence makes it possible to obtain new impossibility results on the resilience to data poisoning as corollaries of existing impossibility theorems on Byzantine machine learning. Moreover, using our equivalence, we derive a practical attack that we show (theoretically and empirically) can be very effective against classical personalized federated learning models.

* arXiv admin note: text overlap with arXiv:2106.02398

Via

Access Paper or Ask Questions

Combining Differential Privacy and Byzantine Resilience in Distributed SGD

Oct 26, 2021

Rachid Guerraoui, Nirupam Gupta, Rafael Pinot, Sebastien Rouault, John Stephan

Figure 1 for Combining Differential Privacy and Byzantine Resilience in Distributed SGD

Figure 2 for Combining Differential Privacy and Byzantine Resilience in Distributed SGD

Figure 3 for Combining Differential Privacy and Byzantine Resilience in Distributed SGD

Figure 4 for Combining Differential Privacy and Byzantine Resilience in Distributed SGD

Abstract:Privacy and Byzantine resilience (BR) are two crucial requirements of modern-day distributed machine learning. The two concepts have been extensively studied individually but the question of how to combine them effectively remains unanswered. This paper contributes to addressing this question by studying the extent to which the distributed SGD algorithm, in the standard parameter-server architecture, can learn an accurate model despite (a) a fraction of the workers being malicious (Byzantine), and (b) the other fraction, whilst being honest, providing noisy information to the server to ensure differential privacy (DP). We first observe that the integration of standard practices in DP and BR is not straightforward. In fact, we show that many existing results on the convergence of distributed SGD under Byzantine faults, especially those relying on $(\alpha,f)$-Byzantine resilience, are rendered invalid when honest workers enforce DP. To circumvent this shortcoming, we revisit the theory of $(\alpha,f)$-BR to obtain an approximate convergence guarantee. Our analysis provides key insights on how to improve this guarantee through hyperparameter optimization. Essentially, our theoretical and empirical results show that (1) an imprudent combination of standard approaches to DP and BR might be fruitless, but (2) by carefully re-tuning the learning algorithm, we can obtain reasonable learning accuracy while simultaneously guaranteeing DP and BR.

Via

Access Paper or Ask Questions

Strategyproof Learning: Building Trustworthy User-Generated Datasets

Jun 04, 2021

Sadegh Farhadkhani, Rachid Guerraoui, Lê-Nguyên Hoang

Figure 1 for Strategyproof Learning: Building Trustworthy User-Generated Datasets

Abstract:Today's large-scale machine learning algorithms harness massive amounts of user-generated data to train large models. However, especially in the context of content recommendation with enormous social, economical and political incentives to promote specific views, products or ideologies, strategic users might be tempted to fabricate or mislabel data in order to bias algorithms in their favor. Unfortunately, today's learning schemes strongly incentivize such strategic data misreporting. This is a major concern, as it endangers the trustworthiness of the entire training datasets, and questions the safety of any algorithm trained on such datasets. In this paper, we show that, perhaps surprisingly, incentivizing data misreporting is not a fatality. We propose the first personalized collaborative learning framework, Licchavi, with provable strategyproofness guarantees through a careful design of the underlying loss function. Interestingly, we also prove that Licchavi is Byzantine resilient: it tolerates a minority of users that provide arbitrary data.

* 31 pages

Via

Access Paper or Ask Questions

Differential Privacy and Byzantine Resilience in SGD: Do They Add Up?

Feb 16, 2021

Rachid Guerraoui, Nirupam Gupta, Rafaël Pinot, Sébastien Rouault, John Stephan

Figure 1 for Differential Privacy and Byzantine Resilience in SGD: Do They Add Up?

Figure 2 for Differential Privacy and Byzantine Resilience in SGD: Do They Add Up?

Figure 3 for Differential Privacy and Byzantine Resilience in SGD: Do They Add Up?

Figure 4 for Differential Privacy and Byzantine Resilience in SGD: Do They Add Up?

Abstract:This paper addresses the problem of combining Byzantine resilience with privacy in machine learning (ML). Specifically, we study whether a distributed implementation of the renowned Stochastic Gradient Descent (SGD) learning algorithm is feasible with both differential privacy (DP) and $(\alpha,f)$-Byzantine resilience. To the best of our knowledge, this is the first work to tackle this problem from a theoretical point of view. A key finding of our analyses is that the classical approaches to these two (seemingly) orthogonal issues are incompatible. More precisely, we show that a direct composition of these techniques makes the guarantees of the resulting SGD algorithm depend unfavourably upon the number of parameters in the ML model, making the training of large models practically infeasible. We validate our theoretical results through numerical experiments on publicly-available datasets; showing that it is impractical to ensure DP and Byzantine resilience simultaneously.

Via

Access Paper or Ask Questions

Garfield: System Support for Byzantine Machine Learning

Oct 12, 2020

El-Mahdi El-Mhamdi, Rachid Guerraoui, Arsany Guirguis, Sébastien Rouault

Figure 1 for Garfield: System Support for Byzantine Machine Learning

Figure 2 for Garfield: System Support for Byzantine Machine Learning

Figure 3 for Garfield: System Support for Byzantine Machine Learning

Figure 4 for Garfield: System Support for Byzantine Machine Learning

Abstract:Byzantine Machine Learning (ML) systems are nowadays vulnerable for they require trusted machines and/or a synchronous network. We present Garfield, a system that provably achieves Byzantine resilience in ML applications without assuming any trusted component nor any bound on communication or computation delays. Garfield leverages ML specificities to make progress despite consensus being impossible in such an asynchronous, Byzantine environment. Following the classical server/worker architecture, Garfield replicates the parameter server while relying on the statistical properties of stochastic gradient descent to keep the models on the correct servers close to each other. On the other hand, Garfield uses statistically-robust gradient aggregation rules (GARs) to achieve resilience against Byzantine workers. We integrate Garfield with two widely-used ML frameworks, TensorFlow and PyTorch, while achieving transparency: applications developed with either framework do not need to change their interfaces to be made Byzantine resilient. Our implementation supports full-stack computations on both CPUs and GPUs. We report on our evaluation of Garfield with different (a) baselines, (b) ML models (e.g., ResNet-50 and VGG), and (c) hardware infrastructures (CPUs and GPUs). Our evaluation highlights several interesting facts about the cost of Byzantine resilience. In particular, (a) Byzantine resilience, unlike crash resilience, induces an accuracy loss, and (b) the throughput overhead comes much more from communication (70%) than from aggregation.

* 32 pages; 16 figures; 2 tables

Via

Access Paper or Ask Questions

Collaborative Learning as an Agreement Problem

Aug 04, 2020

El-Mahdi El-Mhamdi, Rachid Guerraoui, Arsany Guirguis, Lê Nguyên Hoang, Sébastien Rouault

Figure 1 for Collaborative Learning as an Agreement Problem

Abstract:We address the problem of Byzantine collaborative learning: a set of $n$ nodes try to collectively learn from data, whose distributions may vary from one node to another. None of them is trusted and $f < n$ can behave arbitrarily. We show that collaborative learning is equivalent to a new form of agreement, which we call averaging agreement. In this problem, nodes start each with an initial vector and seek to approximately agree on a common vector, while guaranteeing that this common vector remains within a constant (also called averaging constant) of the maximum distance between the original vectors. Essentially, the smaller the averaging constant, the better the learning. We present three asynchronous solutions to averaging agreement, each interesting in its own right. The first, based on the minimum volume ellipsoid, achieves asymptotically the best-possible averaging constant but requires $ n \geq 6f+1$. The second, based on reliable broadcast, achieves optimal Byzantine resilience, i.e., $n \geq 3f+1$, but requires signatures and induces a large number of communication rounds. The third, based on coordinate-wise trimmed mean, is faster and achieves optimal Byzantine resilience, i.e., $n \geq 4f+1$, within standard form algorithms that do not use signatures.

Via

Access Paper or Ask Questions