As the network scale increases, existing fully distributed solutions start to lag behind the real-world challenges such as (1) slow information propagation, (2) network communication failures, and (3) external adversarial attacks. In this paper, we focus on hierarchical system architecture and address the problem of non-Bayesian learning over networks that are vulnerable to communication failures and adversarial attacks. On network communication, we consider packet-dropping link failures. We first propose a hierarchical robust push-sum algorithm that can achieve average consensus despite frequent packet-dropping link failures. We provide a sparse information fusion rule between the parameter server and arbitrarily selected network representatives. Then, interleaving the consensus update step with a dual averaging update with Kullback-Leibler (KL) divergence as the proximal function, we obtain a packet-dropping fault-tolerant non-Bayesian learning algorithm with provable convergence guarantees. On external adversarial attacks, we consider Byzantine attacks in which the compromised agents can send maliciously calibrated messages to others (including both the agents and the parameter server). To avoid the curse of dimensionality of Byzantine consensus, we solve the non-Bayesian learning problem via running multiple dynamics, each of which only involves Byzantine consensus with scalar inputs. To facilitate resilient information propagation across sub-networks, we use a novel Byzantine-resilient gossiping-type rule at the parameter server.
Fully distributed estimation and tracking solutions to large-scale multi-agent networks suffer slow convergence and are vulnerable to network failures. In this paper, we aim to speed up the convergence and enhance the resilience of state estimation and tracking using a simple hierarchical system architecture wherein agents are clusters into smaller networks, and a parameter server exists to aid the information exchanges among networks. The information exchange among networks is expensive and occurs only once in a while. We propose two consensus + innovation algorithms for the state estimation and tracking problems, respectively. In both algorithms, we use a novel hierarchical push-sum consensus component. For the state estimation, we use dual averaging as the local innovation component. State tracking is much harder to tackle in the presence of dropping-link failures and the standard integration of the consensus and innovation approaches are no longer applicable. Moreover, dual averaging is no longer feasible. Our algorithm introduces a pair of additional variables per link and ensure the relevant local variables evolve according to the state dynamics, and use projected local gradient descent as the local innovation component. We also characterize the convergence rates of both of the algorithms under linear local observation model and minimal technical assumptions. We numerically validate our algorithm through simulation of both state estimation and tracking problems.
Federated learning (FL) is a decentralized learning framework wherein a parameter server (PS) and a collection of clients collaboratively train a model via minimizing a global objective. Communication bandwidth is a scarce resource; in each round, the PS aggregates the updates from a subset of clients only. In this paper, we focus on non-convex minimization that is vulnerable to non-uniform and time-varying communication failures between the PS and the clients. Specifically, in each round $t$, the link between the PS and client $i$ is active with probability $p_i^t$, which is $\textit{unknown}$ to both the PS and the clients. This arises when the channel conditions are heterogeneous across clients and are changing over time. We show that when the $p_i^t$'s are not uniform, $\textit{Federated Average}$ (FedAvg) -- the most widely adopted FL algorithm -- fails to minimize the global objective. Observing this, we propose $\textit{Federated Postponed Broadcast}$ (FedPBC) which is a simple variant of FedAvg. It differs from FedAvg in that the PS postpones broadcasting the global model till the end of each round. We show that FedPBC converges to a stationary point of the original objective. The introduced staleness is mild and there is no noticeable slowdown. Both theoretical analysis and numerical results are provided. On the technical front, postponing the global model broadcasts enables implicit gossiping among the clients with active links at round $t$. Despite $p_i^t$'s are time-varying, we are able to bound the perturbation of the global model dynamics via the techniques of controlling the gossip-type information mixing errors.
Federated learning is a decentralized machine learning framework wherein not all clients are able to participate in each round. An emerging line of research is devoted to tackling arbitrary client unavailability. Existing theoretical analysis imposes restrictive structural assumptions on the unavailability patterns, and their proposed algorithms were tailored to those assumptions. In this paper, we relax those assumptions and consider adversarial client unavailability. To quantify the degrees of client unavailability, we use the notion of {\em $\epsilon$-adversary dropout fraction}. For both non-convex and strongly-convex global objectives, we show that simple variants of FedAvg or FedProx, albeit completely agnostic to $\epsilon$, converge to an estimation error on the order of $\epsilon (G^2 + \sigma^2)$, where $G$ is a heterogeneity parameter and $\sigma^2$ is the noise level. We prove that this estimation error is minimax-optimal. We also show that the variants of FedAvg or FedProx have convergence speeds $O(1/\sqrt{T})$ for non-convex objectives and $O(1/T)$ for strongly-convex objectives, both of which are the best possible for any first-order method that only has access to noisy gradients. Our proofs build upon a tight analysis of the selection bias that persists in the entire learning process. We validate our theoretical prediction through numerical experiments on synthetic and real-world datasets.
Deep learning is the method of choice for trajectory prediction for autonomous vehicles. Unfortunately, its data-hungry nature implicitly requires the availability of sufficiently rich and high-quality centralized datasets, which easily leads to privacy leakage. Besides, uncertainty-awareness becomes increasingly important for safety-crucial cyber physical systems whose prediction module heavily relies on machine learning tools. In this paper, we relax the data collection requirement and enhance uncertainty-awareness by using Federated Learning on Connected Autonomous Vehicles with an uncertainty-aware global objective. We name our algorithm as FLTP. We further introduce ALFLTP which boosts FLTP via using active learning techniques in adaptatively selecting participating clients. We consider both negative log-likelihood (NLL) and aleatoric uncertainty (AU) as client selection metrics. Experiments on Argoverse dataset show that FLTP significantly outperforms the model trained on local data. In addition, ALFLTP-AU converges faster in training regression loss and performs better in terms of NLL, minADE and MR than FLTP in most rounds, and has more stable round-wise performance than ALFLTP-NLL.
Federated Learning (FL) is a nascent privacy-preserving learning framework under which the local data of participating clients is kept locally throughout model training. Scarce communication resources and data heterogeneity are two defining characteristics of FL. Besides, a FL system is often implemented in a harsh environment -- leaving the clients vulnerable to Byzantine attacks. To the best of our knowledge, no gradient compressors simultaneously achieve quantitative Byzantine resilience and privacy preservation. In this paper, we fill this gap via revisiting the stochastic sign SGD \cite{jin 2020}. We propose $\beta$-stochastic sign SGD, which contains a gradient compressor that encodes a client's gradient information in sign bits subject to the privacy budget $\beta>0$. We show that as long as $\beta>0$, $\beta$-stochastic sign SGD converges in the presence of partial client participation and mobile Byzantine faults, showing that it achieves quantifiable Byzantine-resilience and differential privacy simultaneously. In sharp contrast, when $\beta=0$, the compressor is not differentially private. Notably, for the special case when each of the stochastic gradients involved is bounded with known bounds, our gradient compressor with $\beta=0$ coincides with the compressor proposed in \cite{jin 2020}. As a byproduct, we show that when the clients report sign messages, the popular information aggregation rules simple mean, trimmed mean, median and majority vote are identical in terms of the output signs. Our theories are corroborated by experiments on MNIST and CIFAR-10 datasets.
This paper studies the problem of model training under Federated Learning when clients exhibit cluster structure. We contextualize this problem in mixed regression, where each client has limited local data generated from one of $k$ unknown regression models. We design an algorithm that achieves global convergence from any initialization, and works even when local data volume is highly unbalanced -- there could exist clients that contain $O(1)$ data points only. Our algorithm first runs moment descent on a few anchor clients (each with $\tilde{\Omega}(k)$ data points) to obtain coarse model estimates. Then each client alternately estimates its cluster labels and refines the model estimates based on FedAvg or FedProx. A key innovation in our analysis is a uniform estimate on the clustering errors, which we prove by bounding the VC dimension of general polynomial concept classes based on the theory of algebraic geometry.
Federated Learning (FL) is a promising framework that has great potentials in privacy preservation and in lowering the computation load at the cloud. FedAvg and FedProx are two widely adopted algorithms. However, recent work raised concerns on these two methods: (1) their fixed points do not correspond to the stationary points of the original optimization problem, and (2) the common model found might not generalize well locally. In this paper, we alleviate these concerns. Towards this, we adopt the statistical learning perspective yet allow the distributions to be heterogeneous and the local data to be unbalanced. We show, in the general kernel regression setting, that both FedAvg and FedProx converge to the minimax-optimal error rates. Moreover, when the kernel function has a finite rank, the convergence is exponentially fast. Our results further analytically quantify the impact of the model heterogeneity and characterize the federation gain - the reduction of the estimation error for a worker to join the federated learning compared to the best local estimator. To the best of our knowledge, we are the first to show the achievability of minimax error rates under FedAvg and FedProx, and the first to characterize the gains in joining FL. Numerical experiments further corroborate our theoretical findings on the statistical optimality of FedAvg and FedProx and the federation gains.
We consider training over-parameterized two-layer neural networks with Rectified Linear Unit (ReLU) using gradient descent (GD) method. Inspired by a recent line of work, we study the evolutions of the network prediction errors across GD iterations, which can be neatly described in a matrix form. It turns out that when the network is sufficiently over-parameterized, these matrices individually approximate an integral operator which is determined by the feature vector distribution $\rho$ only. Consequently, GD method can be viewed as approximately apply the powers of this integral operator on the underlying/target function $f^*$ that generates the responses/labels. We show that if $f^*$ admits a low-rank approximation with respect to the eigenspaces of this integral operator, then, even with constant stepsize, the empirical risk decreases to this low-rank approximation error at a linear rate in iteration $t$. In addition, this linear rate is determined by $f^*$ and $\rho$ only. Furthermore, if $f^*$ has zero low-rank approximation error, then $\Omega(n^2)$ network over-parameterization is enough, and the empirical risk decreases to $\Theta(1/\sqrt{n})$. We provide an application of our general results to the setting where $\rho$ is the uniform distribution on the spheres and $f^*$ is a polynomial.
Standard distributed machine learning frameworks require collecting the training data from data providers and storing it in a datacenter. To ease privacy concerns, alternative distributed machine learning frameworks (such as {\em Federated Learning}) have been proposed, wherein the training data is kept confidential by its providers from the learner, and the learner learns the model by communicating with data providers. However, such frameworks suffer from serious security risks, as data providers are vulnerable to adversarial attacks and the learner lacks of enough administrative power. We assume in each communication round, up to $q$ out of the $m$ data providers/workers suffer Byzantine faults. Each worker keeps a local sample of size $n$ and the total sample size is $N=nm$. Of particular interest is the high-dimensional regime, where the local sample size $n$ is much smaller than the model dimension $d$. We propose a secured variant of the gradient descent method and show that it tolerates up to a constant fraction of Byzantine workers. Moreover, we show the statistical estimation error of the iterates converges in $O(\log N)$ rounds to $O(\sqrt{q/N} + \sqrt{d/N})$, which is larger than the minimax-optimal error rate $O(\sqrt{d/N})$ in the failure-free setting by at most an additive term $O(\sqrt{q/N})$. As long as $q=O(d)$, our proposed algorithm achieves the optimal error rate $O(\sqrt{d/N})$. The core of our method is a robust gradient aggregator based on the iterative filtering algorithm proposed by Steinhardt et al. We establish a {\em uniform} concentration of the sample covariance matrix of gradients, and show that the aggregated gradient, as a function of model parameter, converges uniformly to the true gradient function. As a by-product, we develop a new concentration inequality for sample covariance matrices, which might be of independent interest.