Data heterogeneity is an inherent challenge that hinders the performance of federated learning (FL). Recent studies have identified the biased classifiers of local models as the key bottleneck. Previous attempts have used classifier calibration after FL training, but this approach falls short in improving the poor feature representations caused by training-time classifier biases. Resolving the classifier bias dilemma in FL requires a full understanding of the mechanisms behind the classifier. Recent advances in neural collapse have shown that the classifiers and feature prototypes under perfect training scenarios collapse into an optimal structure called simplex equiangular tight frame (ETF). Building on this neural collapse insight, we propose a solution to the FL's classifier bias problem by utilizing a synthetic and fixed ETF classifier during training. The optimal classifier structure enables all clients to learn unified and optimal feature representations even under extremely heterogeneous data. We devise several effective modules to better adapt the ETF structure in FL, achieving both high generalization and personalization. Extensive experiments demonstrate that our method achieves state-of-the-art performances on CIFAR-10, CIFAR-100, and Tiny-ImageNet.
In federated learning (FL), weighted aggregation of local models is conducted to generate a global model, and the aggregation weights are normalized (the sum of weights is 1) and proportional to the local data sizes. In this paper, we revisit the weighted aggregation process and gain new insights into the training dynamics of FL. First, we find that the sum of weights can be smaller than 1, causing global weight shrinking effect (analogous to weight decay) and improving generalization. We explore how the optimal shrinking factor is affected by clients' data heterogeneity and local epochs. Second, we dive into the relative aggregation weights among clients to depict the clients' importance. We develop client coherence to study the learning dynamics and find a critical point that exists. Before entering the critical point, more coherent clients play more essential roles in generalization. Based on the above insights, we propose an effective method for Federated Learning with Learnable Aggregation Weights, named as FedLAW. Extensive experiments verify that our method can improve the generalization of the global model by a large margin on different datasets and models.
The classic Bayesian persuasion model assumes a Bayesian and best-responding receiver. We study a relaxation of the Bayesian persuasion model where the receiver can approximately best respond to the sender's signaling scheme. We show that, under natural assumptions, (1) the sender can find a signaling scheme that guarantees itself an expected utility almost as good as its optimal utility in the classic model, no matter what approximately best-responding strategy the receiver uses; (2) on the other hand, there is no signaling scheme that gives the sender much more utility than its optimal utility in the classic model, even if the receiver uses the approximately best-responding strategy that is best for the sender. Together, (1) and (2) imply that the approximately best-responding behavior of the receiver does not affect the sender's maximal achievable utility a lot in the Bayesian persuasion problem. The proofs of both results rely on the idea of robustification of a Bayesian persuasion scheme: given a pair of the sender's signaling scheme and the receiver's strategy, we can construct another signaling scheme such that the receiver prefers to use that strategy in the new scheme more than in the original scheme, and the two schemes give the sender similar utilities. As an application of our main result (1), we show that, in a repeated Bayesian persuasion model where the receiver learns to respond to the sender by some algorithms, the sender can do almost as well as in the classic model. Interestingly, unlike (2), with a learning receiver the sender can sometimes do much better than in the classic model.
Federated Learning (FL) is a machine learning paradigm that protects privacy by keeping client data on edge devices. However, optimizing FL in practice can be challenging due to the diversity and heterogeneity of the learning system. Recent research efforts have aimed to improve the optimization of FL with distribution shifts, but it is still an open problem how to train FL models when multiple types of distribution shifts, i.e., feature distribution skew, label distribution skew, and concept shift occur simultaneously. To address this challenge, we propose a novel algorithm framework, FedConceptEM, for handling diverse distribution shifts in FL. FedConceptEM automatically assigns clients with concept shifts to different models, avoiding the performance drop caused by these shifts. At the same time, clients without concept shifts, even with feature or label skew, are assigned to the same model, improving the robustness of the trained models. Extensive experiments demonstrate that FedConceptEM outperforms other state-of-the-art cluster-based FL methods by a significant margin.
Gradient tracking (GT) is an algorithm designed for solving decentralized optimization problems over a network (such as training a machine learning model). A key feature of GT is a tracking mechanism that allows to overcome data heterogeneity between nodes. We develop a novel decentralized tracking mechanism, $K$-GT, that enables communication-efficient local updates in GT while inheriting the data-independence property of GT. We prove a convergence rate for $K$-GT on smooth non-convex functions and prove that it reduces the communication overhead asymptotically by a linear factor $K$, where $K$ denotes the number of local steps. We illustrate the robustness and effectiveness of this heterogeneity correction on convex and non-convex benchmark problems and on a non-convex neural network training task with the MNIST dataset.
While many classical notions of learnability (e.g., PAC learnability) are distribution-free, utilizing the specific structures of an input distribution may improve learning performance. For example, a product distribution on a multi-dimensional input space has a much simpler structure than a correlated distribution. A recent paper [GHTZ21] shows that the sample complexity of a general learning problem on product distributions is polynomial in the input dimension, which is exponentially smaller than that on correlated distributions. However, the learning algorithm they use is not the standard Empirical Risk Minimization (ERM) algorithm. In this note, we characterize the sample complexity of ERM in a general learning problem on product distributions. We show that, even though product distributions are simpler than correlated distributions, ERM still needs an exponential number of samples to learn on product distributions, instead of a polynomial. This leads to the conclusion that a product distribution by itself does not make a learning problem easier -- an algorithm designed specifically for product distributions is needed.
We consider a Bayesian forecast aggregation model where $n$ experts, after observing private signals about an unknown binary event, report their posterior beliefs about the event to a principal, who then aggregates the reports into a single prediction for the event. The signals of the experts and the outcome of the event follow a joint distribution that is unknown to the principal, but the principal has access to i.i.d. "samples" from the distribution, where each sample is a tuple of experts' reports (not signals) and the realization of the event. Using these samples, the principal aims to find an $\varepsilon$-approximately optimal (Bayesian) aggregator. We study the sample complexity of this problem. We show that, for arbitrary discrete distributions, the number of samples must be at least $\tilde \Omega(m^{n-2} / \varepsilon)$, where $m$ is the size of each expert's signal space. This sample complexity grows exponentially in the number of experts $n$. But if experts' signals are independent conditioned on the realization of the event, then the sample complexity is significantly reduced, to $\tilde O(1 / \varepsilon^2)$, which does not depend on $n$.
Federated learning (FL) is a distributed machine learning paradigm that selects a subset of clients to participate in training to reduce communication burdens. However, partial client participation in FL causes \emph{objective inconsistency}, which can hinder the convergence, while this objective inconsistency has not been analyzed in existing studies on sampling methods. To tackle this issue, we propose an improved analysis method that focuses on the convergence behavior of the practical participated client's objective. Moreover, based on our convergence analysis, we give a novel unbiased sampling strategy, i.e., FedSRC-D, whose sampling probability is proportional to the client's gradient diversity and local variance. FedSRC-D is provable the optimal unbiased sampling in non-convex settings for non-IID FL with respect to the given bounds. Specifically, FedSRC-D achieves $\mathop{O}(\frac{G^2}{\epsilon^2}+\frac{1}{\epsilon^{2/3}})$ higher than SOTA convergence rate of FedAvg, and $\mathop{O}(\frac{G^2}{\epsilon^2})$ higher than other unbiased sampling methods. We corroborate our results with experiments on both synthetic and real data sets.
Federated Learning (FL) is a machine learning paradigm that learns from data kept locally to safeguard the privacy of clients, whereas local SGD is typically employed on the clients' devices to improve communication efficiency. However, such a scheme is currently constrained by the slow and unstable convergence induced by clients' heterogeneous data. In this work, we identify three under-explored phenomena of the biased local learning that may explain these challenges caused by local updates in supervised FL. As a remedy, we propose FedAug, a novel unified algorithm that reduces the local learning bias on features and classifiers to tackle these challenges. FedAug consists of two components: AugMean and AugCA. AugMean alleviates the bias in the local classifiers by balancing the output distribution of models. AugCA learns client invariant features that are close to global features but considerably distinct from those learned from other input distributions. In a series of experiments, we show that FedAug consistently outperforms other SOTA FL and domain generalization (DG) baselines, in which both two components (i.e., AugMean and AugCA) have individual performance gains.
Federated Learning (FL) is a machine learning paradigm where many clients collaboratively learn a shared global model with decentralized training data. Personalization on FL model additionally adapts the global model to different clients, achieving promising results on consistent local training & test distributions. However, for real-world personalized FL applications, it is crucial to go one step further: robustifying FL models under evolving local test set during deployment, where various types of distribution shifts can arise. In this work, we identify the pitfalls of existing works under test-time distribution shifts and propose a novel test-time robust personalization method, namely Federated Test-time Head Ensemble plus tuning (FedTHE+). We illustrate the advancement of FedTHE+ (and its degraded computationally efficient variant FedTHE) over strong competitors, for training various neural architectures (CNN, ResNet, and Transformer) on CIFAR10 and ImageNet and evaluating on diverse test distributions. Along with this, we build a benchmark for assessing performance and robustness of personalized FL methods during deployment.