Alert button
Picture for Lili Su

Lili Su

Alert button

Network Fault-tolerant and Byzantine-resilient Social Learning via Collaborative Hierarchical Non-Bayesian Learning

Jul 27, 2023
Connor Mclaughlin, Matthew Ding, Denis Edogmus, Lili Su

Figure 1 for Network Fault-tolerant and Byzantine-resilient Social Learning via Collaborative Hierarchical Non-Bayesian Learning

As the network scale increases, existing fully distributed solutions start to lag behind the real-world challenges such as (1) slow information propagation, (2) network communication failures, and (3) external adversarial attacks. In this paper, we focus on hierarchical system architecture and address the problem of non-Bayesian learning over networks that are vulnerable to communication failures and adversarial attacks. On network communication, we consider packet-dropping link failures. We first propose a hierarchical robust push-sum algorithm that can achieve average consensus despite frequent packet-dropping link failures. We provide a sparse information fusion rule between the parameter server and arbitrarily selected network representatives. Then, interleaving the consensus update step with a dual averaging update with Kullback-Leibler (KL) divergence as the proximal function, we obtain a packet-dropping fault-tolerant non-Bayesian learning algorithm with provable convergence guarantees. On external adversarial attacks, we consider Byzantine attacks in which the compromised agents can send maliciously calibrated messages to others (including both the agents and the parameter server). To avoid the curse of dimensionality of Byzantine consensus, we solve the non-Bayesian learning problem via running multiple dynamics, each of which only involves Byzantine consensus with scalar inputs. To facilitate resilient information propagation across sub-networks, we use a novel Byzantine-resilient gossiping-type rule at the parameter server.

* 11 pages, 1 figure 
Viaarxiv icon

Fast and Robust State Estimation and Tracking via Hierarchical Learning

Jun 29, 2023
Connor Mclaughlin, Matthew Ding, Deniz Edogmus, Lili Su

Figure 1 for Fast and Robust State Estimation and Tracking via Hierarchical Learning
Figure 2 for Fast and Robust State Estimation and Tracking via Hierarchical Learning
Figure 3 for Fast and Robust State Estimation and Tracking via Hierarchical Learning
Figure 4 for Fast and Robust State Estimation and Tracking via Hierarchical Learning

Fully distributed estimation and tracking solutions to large-scale multi-agent networks suffer slow convergence and are vulnerable to network failures. In this paper, we aim to speed up the convergence and enhance the resilience of state estimation and tracking using a simple hierarchical system architecture wherein agents are clusters into smaller networks, and a parameter server exists to aid the information exchanges among networks. The information exchange among networks is expensive and occurs only once in a while. We propose two consensus + innovation algorithms for the state estimation and tracking problems, respectively. In both algorithms, we use a novel hierarchical push-sum consensus component. For the state estimation, we use dual averaging as the local innovation component. State tracking is much harder to tackle in the presence of dropping-link failures and the standard integration of the consensus and innovation approaches are no longer applicable. Moreover, dual averaging is no longer feasible. Our algorithm introduces a pair of additional variables per link and ensure the relevant local variables evolve according to the state dynamics, and use projected local gradient descent as the local innovation component. We also characterize the convergence rates of both of the algorithms under linear local observation model and minimal technical assumptions. We numerically validate our algorithm through simulation of both state estimation and tracking problems.

* 14 pages, 5 figures 
Viaarxiv icon

Towards Bias Correction of FedAvg over Nonuniform and Time-Varying Communications

Jun 01, 2023
Ming Xiang, Stratis Ioannidis, Edmund Yeh, Carlee Joe-Wong, Lili Su

Figure 1 for Towards Bias Correction of FedAvg over Nonuniform and Time-Varying Communications
Figure 2 for Towards Bias Correction of FedAvg over Nonuniform and Time-Varying Communications
Figure 3 for Towards Bias Correction of FedAvg over Nonuniform and Time-Varying Communications

Federated learning (FL) is a decentralized learning framework wherein a parameter server (PS) and a collection of clients collaboratively train a model via minimizing a global objective. Communication bandwidth is a scarce resource; in each round, the PS aggregates the updates from a subset of clients only. In this paper, we focus on non-convex minimization that is vulnerable to non-uniform and time-varying communication failures between the PS and the clients. Specifically, in each round $t$, the link between the PS and client $i$ is active with probability $p_i^t$, which is $\textit{unknown}$ to both the PS and the clients. This arises when the channel conditions are heterogeneous across clients and are changing over time. We show that when the $p_i^t$'s are not uniform, $\textit{Federated Average}$ (FedAvg) -- the most widely adopted FL algorithm -- fails to minimize the global objective. Observing this, we propose $\textit{Federated Postponed Broadcast}$ (FedPBC) which is a simple variant of FedAvg. It differs from FedAvg in that the PS postpones broadcasting the global model till the end of each round. We show that FedPBC converges to a stationary point of the original objective. The introduced staleness is mild and there is no noticeable slowdown. Both theoretical analysis and numerical results are provided. On the technical front, postponing the global model broadcasts enables implicit gossiping among the clients with active links at round $t$. Despite $p_i^t$'s are time-varying, we are able to bound the perturbation of the global model dynamics via the techniques of controlling the gossip-type information mixing errors.

Viaarxiv icon

Federated Learning in the Presence of Adversarial Client Unavailability

May 31, 2023
Lili Su, Jiaming Xu, Pengkun Yang

Figure 1 for Federated Learning in the Presence of Adversarial Client Unavailability
Figure 2 for Federated Learning in the Presence of Adversarial Client Unavailability
Figure 3 for Federated Learning in the Presence of Adversarial Client Unavailability
Figure 4 for Federated Learning in the Presence of Adversarial Client Unavailability

Federated learning is a decentralized machine learning framework wherein not all clients are able to participate in each round. An emerging line of research is devoted to tackling arbitrary client unavailability. Existing theoretical analysis imposes restrictive structural assumptions on the unavailability patterns, and their proposed algorithms were tailored to those assumptions. In this paper, we relax those assumptions and consider adversarial client unavailability. To quantify the degrees of client unavailability, we use the notion of {\em $\epsilon$-adversary dropout fraction}. For both non-convex and strongly-convex global objectives, we show that simple variants of FedAvg or FedProx, albeit completely agnostic to $\epsilon$, converge to an estimation error on the order of $\epsilon (G^2 + \sigma^2)$, where $G$ is a heterogeneity parameter and $\sigma^2$ is the noise level. We prove that this estimation error is minimax-optimal. We also show that the variants of FedAvg or FedProx have convergence speeds $O(1/\sqrt{T})$ for non-convex objectives and $O(1/T)$ for strongly-convex objectives, both of which are the best possible for any first-order method that only has access to noisy gradients. Our proofs build upon a tight analysis of the selection bias that persists in the entire learning process. We validate our theoretical prediction through numerical experiments on synthetic and real-world datasets.

Viaarxiv icon

Privacy-preserving and Uncertainty-aware Federated Trajectory Prediction for Connected Autonomous Vehicles

Mar 08, 2023
Muzi Peng, Jiangwei Wang, Dongjin Song, Fei Miao, Lili Su

Figure 1 for Privacy-preserving and Uncertainty-aware Federated Trajectory Prediction for Connected Autonomous Vehicles
Figure 2 for Privacy-preserving and Uncertainty-aware Federated Trajectory Prediction for Connected Autonomous Vehicles
Figure 3 for Privacy-preserving and Uncertainty-aware Federated Trajectory Prediction for Connected Autonomous Vehicles
Figure 4 for Privacy-preserving and Uncertainty-aware Federated Trajectory Prediction for Connected Autonomous Vehicles

Deep learning is the method of choice for trajectory prediction for autonomous vehicles. Unfortunately, its data-hungry nature implicitly requires the availability of sufficiently rich and high-quality centralized datasets, which easily leads to privacy leakage. Besides, uncertainty-awareness becomes increasingly important for safety-crucial cyber physical systems whose prediction module heavily relies on machine learning tools. In this paper, we relax the data collection requirement and enhance uncertainty-awareness by using Federated Learning on Connected Autonomous Vehicles with an uncertainty-aware global objective. We name our algorithm as FLTP. We further introduce ALFLTP which boosts FLTP via using active learning techniques in adaptatively selecting participating clients. We consider both negative log-likelihood (NLL) and aleatoric uncertainty (AU) as client selection metrics. Experiments on Argoverse dataset show that FLTP significantly outperforms the model trained on local data. In addition, ALFLTP-AU converges faster in training regression loss and performs better in terms of NLL, minADE and MR than FLTP in most rounds, and has more stable round-wise performance than ALFLTP-NLL.

Viaarxiv icon

$β$-Stochastic Sign SGD: A Byzantine Resilient and Differentially Private Gradient Compressor for Federated Learning

Oct 03, 2022
Ming Xiang, Lili Su

Figure 1 for $β$-Stochastic Sign SGD: A Byzantine Resilient and Differentially Private Gradient Compressor for Federated Learning
Figure 2 for $β$-Stochastic Sign SGD: A Byzantine Resilient and Differentially Private Gradient Compressor for Federated Learning
Figure 3 for $β$-Stochastic Sign SGD: A Byzantine Resilient and Differentially Private Gradient Compressor for Federated Learning
Figure 4 for $β$-Stochastic Sign SGD: A Byzantine Resilient and Differentially Private Gradient Compressor for Federated Learning

Federated Learning (FL) is a nascent privacy-preserving learning framework under which the local data of participating clients is kept locally throughout model training. Scarce communication resources and data heterogeneity are two defining characteristics of FL. Besides, a FL system is often implemented in a harsh environment -- leaving the clients vulnerable to Byzantine attacks. To the best of our knowledge, no gradient compressors simultaneously achieve quantitative Byzantine resilience and privacy preservation. In this paper, we fill this gap via revisiting the stochastic sign SGD \cite{jin 2020}. We propose $\beta$-stochastic sign SGD, which contains a gradient compressor that encodes a client's gradient information in sign bits subject to the privacy budget $\beta>0$. We show that as long as $\beta>0$, $\beta$-stochastic sign SGD converges in the presence of partial client participation and mobile Byzantine faults, showing that it achieves quantifiable Byzantine-resilience and differential privacy simultaneously. In sharp contrast, when $\beta=0$, the compressor is not differentially private. Notably, for the special case when each of the stochastic gradients involved is bounded with known bounds, our gradient compressor with $\beta=0$ coincides with the compressor proposed in \cite{jin 2020}. As a byproduct, we show that when the clients report sign messages, the popular information aggregation rules simple mean, trimmed mean, median and majority vote are identical in terms of the output signs. Our theories are corroborated by experiments on MNIST and CIFAR-10 datasets.

Viaarxiv icon

Global Convergence of Federated Learning for Mixed Regression

Jun 15, 2022
Lili Su, Jiaming Xu, Pengkun Yang

This paper studies the problem of model training under Federated Learning when clients exhibit cluster structure. We contextualize this problem in mixed regression, where each client has limited local data generated from one of $k$ unknown regression models. We design an algorithm that achieves global convergence from any initialization, and works even when local data volume is highly unbalanced -- there could exist clients that contain $O(1)$ data points only. Our algorithm first runs moment descent on a few anchor clients (each with $\tilde{\Omega}(k)$ data points) to obtain coarse model estimates. Then each client alternately estimates its cluster labels and refines the model estimates based on FedAvg or FedProx. A key innovation in our analysis is a uniform estimate on the clustering errors, which we prove by bounding the VC dimension of general polynomial concept classes based on the theory of algebraic geometry.

Viaarxiv icon

Achieving Statistical Optimality of Federated Learning: Beyond Stationary Points

Jun 29, 2021
Lili Su, Jiaming Xu, Pengkun Yang

Figure 1 for Achieving Statistical Optimality of Federated Learning: Beyond Stationary Points
Figure 2 for Achieving Statistical Optimality of Federated Learning: Beyond Stationary Points
Figure 3 for Achieving Statistical Optimality of Federated Learning: Beyond Stationary Points
Figure 4 for Achieving Statistical Optimality of Federated Learning: Beyond Stationary Points

Federated Learning (FL) is a promising framework that has great potentials in privacy preservation and in lowering the computation load at the cloud. FedAvg and FedProx are two widely adopted algorithms. However, recent work raised concerns on these two methods: (1) their fixed points do not correspond to the stationary points of the original optimization problem, and (2) the common model found might not generalize well locally. In this paper, we alleviate these concerns. Towards this, we adopt the statistical learning perspective yet allow the distributions to be heterogeneous and the local data to be unbalanced. We show, in the general kernel regression setting, that both FedAvg and FedProx converge to the minimax-optimal error rates. Moreover, when the kernel function has a finite rank, the convergence is exponentially fast. Our results further analytically quantify the impact of the model heterogeneity and characterize the federation gain - the reduction of the estimation error for a worker to join the federated learning compared to the best local estimator. To the best of our knowledge, we are the first to show the achievability of minimax error rates under FedAvg and FedProx, and the first to characterize the gains in joining FL. Numerical experiments further corroborate our theoretical findings on the statistical optimality of FedAvg and FedProx and the federation gains.

Viaarxiv icon

On Learning Over-parameterized Neural Networks: A Functional Approximation Prospective

May 26, 2019
Lili Su, Pengkun Yang

Figure 1 for On Learning Over-parameterized Neural Networks: A Functional Approximation Prospective
Figure 2 for On Learning Over-parameterized Neural Networks: A Functional Approximation Prospective
Figure 3 for On Learning Over-parameterized Neural Networks: A Functional Approximation Prospective

We consider training over-parameterized two-layer neural networks with Rectified Linear Unit (ReLU) using gradient descent (GD) method. Inspired by a recent line of work, we study the evolutions of the network prediction errors across GD iterations, which can be neatly described in a matrix form. It turns out that when the network is sufficiently over-parameterized, these matrices individually approximate an integral operator which is determined by the feature vector distribution $\rho$ only. Consequently, GD method can be viewed as approximately apply the powers of this integral operator on the underlying/target function $f^*$ that generates the responses/labels. We show that if $f^*$ admits a low-rank approximation with respect to the eigenspaces of this integral operator, then, even with constant stepsize, the empirical risk decreases to this low-rank approximation error at a linear rate in iteration $t$. In addition, this linear rate is determined by $f^*$ and $\rho$ only. Furthermore, if $f^*$ has zero low-rank approximation error, then $\Omega(n^2)$ network over-parameterization is enough, and the empirical risk decreases to $\Theta(1/\sqrt{n})$. We provide an application of our general results to the setting where $\rho$ is the uniform distribution on the spheres and $f^*$ is a polynomial.

Viaarxiv icon

Securing Distributed Machine Learning in High Dimensions

Jun 08, 2018
Lili Su, Jiaming Xu

Standard distributed machine learning frameworks require collecting the training data from data providers and storing it in a datacenter. To ease privacy concerns, alternative distributed machine learning frameworks (such as {\em Federated Learning}) have been proposed, wherein the training data is kept confidential by its providers from the learner, and the learner learns the model by communicating with data providers. However, such frameworks suffer from serious security risks, as data providers are vulnerable to adversarial attacks and the learner lacks of enough administrative power. We assume in each communication round, up to $q$ out of the $m$ data providers/workers suffer Byzantine faults. Each worker keeps a local sample of size $n$ and the total sample size is $N=nm$. Of particular interest is the high-dimensional regime, where the local sample size $n$ is much smaller than the model dimension $d$. We propose a secured variant of the gradient descent method and show that it tolerates up to a constant fraction of Byzantine workers. Moreover, we show the statistical estimation error of the iterates converges in $O(\log N)$ rounds to $O(\sqrt{q/N} + \sqrt{d/N})$, which is larger than the minimax-optimal error rate $O(\sqrt{d/N})$ in the failure-free setting by at most an additive term $O(\sqrt{q/N})$. As long as $q=O(d)$, our proposed algorithm achieves the optimal error rate $O(\sqrt{d/N})$. The core of our method is a robust gradient aggregator based on the iterative filtering algorithm proposed by Steinhardt et al. We establish a {\em uniform} concentration of the sample covariance matrix of gradients, and show that the aggregated gradient, as a function of model parameter, converges uniformly to the true gradient function. As a by-product, we develop a new concentration inequality for sample covariance matrices, which might be of independent interest.

Viaarxiv icon