Alert button
Picture for Shuheng Shen

Shuheng Shen

Alert button

Joint Local Relational Augmentation and Global Nash Equilibrium for Federated Learning with Non-IID Data

Aug 17, 2023
Xinting Liao, Chaochao Chen, Weiming Liu, Pengyang Zhou, Huabin Zhu, Shuheng Shen, Weiqiang Wang, Mengling Hu, Yanchao Tan, Xiaolin Zheng

Federated learning (FL) is a distributed machine learning paradigm that needs collaboration between a server and a series of clients with decentralized data. To make FL effective in real-world applications, existing work devotes to improving the modeling of decentralized data with non-independent and identical distributions (non-IID). In non-IID settings, there are intra-client inconsistency that comes from the imbalanced data modeling, and inter-client inconsistency among heterogeneous client distributions, which not only hinders sufficient representation of the minority data, but also brings discrepant model deviations. However, previous work overlooks to tackle the above two coupling inconsistencies together. In this work, we propose FedRANE, which consists of two main modules, i.e., local relational augmentation (LRA) and global Nash equilibrium (GNE), to resolve intra- and inter-client inconsistency simultaneously. Specifically, in each client, LRA mines the similarity relations among different data samples and enhances the minority sample representations with their neighbors using attentive message passing. In server, GNE reaches an agreement among inconsistent and discrepant model deviations from clients to server, which encourages the global model to update in the direction of global optimum without breaking down the clients optimization toward their local optimums. We conduct extensive experiments on four benchmark datasets to show the superiority of FedRANE in enhancing the performance of FL with non-IID data.

* To appear in ACM International Conference on Multimedia (ACM MM23) 
Viaarxiv icon

Differentially Private Learning with Per-Sample Adaptive Clipping

Dec 01, 2022
Tianyu Xia, Shuheng Shen, Su Yao, Xinyi Fu, Ke Xu, Xiaolong Xu, Xing Fu, Weiqiang Wang

Figure 1 for Differentially Private Learning with Per-Sample Adaptive Clipping
Figure 2 for Differentially Private Learning with Per-Sample Adaptive Clipping
Figure 3 for Differentially Private Learning with Per-Sample Adaptive Clipping
Figure 4 for Differentially Private Learning with Per-Sample Adaptive Clipping

Privacy in AI remains a topic that draws attention from researchers and the general public in recent years. As one way to implement privacy-preserving AI, differentially private learning is a framework that enables AI models to use differential privacy (DP). To achieve DP in the learning process, existing algorithms typically limit the magnitude of gradients with a constant clipping, which requires carefully tuned due to its significant impact on model performance. As a solution to this issue, latest works NSGD and Auto-S innovatively propose to use normalization instead of clipping to avoid hyperparameter tuning. However, normalization-based approaches like NSGD and Auto-S rely on a monotonic weight function, which imposes excessive weight on small gradient samples and introduces extra deviation to the update. In this paper, we propose a Differentially Private Per-Sample Adaptive Clipping (DP-PSAC) algorithm based on a non-monotonic adaptive weight function, which guarantees privacy without the typical hyperparameter tuning process of using a constant clipping while significantly reducing the deviation between the update and true batch-averaged gradient. We provide a rigorous theoretical convergence analysis and show that with convergence rate at the same order, the proposed algorithm achieves a lower non-vanishing bound, which is maintained over training iterations, compared with NSGD/Auto-S. In addition, through extensive experimental evaluation, we show that DP-PSAC outperforms or matches the state-of-the-art methods on multiple main-stream vision and language tasks.

* To appear in AAAI 2023 
Viaarxiv icon

Deep Unified Representation for Heterogeneous Recommendation

Jan 26, 2022
Chengqiang Lu, Mingyang Yin, Shuheng Shen, Luo Ji, Qi Liu, Hongxia Yang

Figure 1 for Deep Unified Representation for Heterogeneous Recommendation
Figure 2 for Deep Unified Representation for Heterogeneous Recommendation
Figure 3 for Deep Unified Representation for Heterogeneous Recommendation
Figure 4 for Deep Unified Representation for Heterogeneous Recommendation

Recommendation system has been a widely studied task both in academia and industry. Previous works mainly focus on homogeneous recommendation and little progress has been made for heterogeneous recommender systems. However, heterogeneous recommendations, e.g., recommending different types of items including products, videos, celebrity shopping notes, among many others, are dominant nowadays. State-of-the-art methods are incapable of leveraging attributes from different types of items and thus suffer from data sparsity problems. And it is indeed quite challenging to represent items with different feature spaces jointly. To tackle this problem, we propose a kernel-based neural network, namely deep unified representation (or DURation) for heterogeneous recommendation, to jointly model unified representations of heterogeneous items while preserving their original feature space topology structures. Theoretically, we prove the representation ability of the proposed model. Besides, we conduct extensive experiments on real-world datasets. Experimental results demonstrate that with the unified representation, our model achieves remarkable improvement (e.g., 4.1% ~ 34.9% lift by AUC score and 3.7% lift by online CTR) over existing state-of-the-art models.

* 12 pages, 4 figures, accepted by the ACM Web Conference 2022 (WWW '22) 
Viaarxiv icon

STL-SGD: Speeding Up Local SGD with Stagewise Communication Period

Jun 11, 2020
Shuheng Shen, Yifei Cheng, Jingchang Liu, Linli Xu

Figure 1 for STL-SGD: Speeding Up Local SGD with Stagewise Communication Period
Figure 2 for STL-SGD: Speeding Up Local SGD with Stagewise Communication Period
Figure 3 for STL-SGD: Speeding Up Local SGD with Stagewise Communication Period
Figure 4 for STL-SGD: Speeding Up Local SGD with Stagewise Communication Period

Distributed parallel stochastic gradient descent algorithms are workhorses for large scale machine learning tasks. Among them, local stochastic gradient descent (Local SGD) has attracted significant attention due to its low communication complexity. Previous studies prove that the communication complexity of Local SGD with a fixed or an adaptive communication period is in the order of $O (N^{\frac{3}{2}} T^{\frac{1}{2}})$ and $O (N^{\frac{3}{4}} T^{\frac{3}{4}})$ when the data distributions on clients are identical (IID) or otherwise (Non-IID). In this paper, to accelerate the convergence by reducing the communication complexity, we propose \textit{ST}agewise \textit{L}ocal \textit{SGD} (STL-SGD), which increases the communication period gradually along with decreasing learning rate. We prove that STL-SGD can keep the same convergence rate and linear speedup as mini-batch SGD. In addition, as the benefit of increasing the communication period, when the objective is strongly convex or satisfies the Polyak-\L ojasiewicz condition, the communication complexity of STL-SGD is $O (N \log{T})$ and $O (N^{\frac{1}{2}} T^{\frac{1}{2}})$ for the IID case and the Non-IID case respectively, achieving significant improvements over Local SGD. Experiments on both convex and non-convex problems demonstrate the superior performance of STL-SGD.

* 27 pages, 16 figures 
Viaarxiv icon

Variance Reduced Local SGD with Lower Communication Complexity

Dec 30, 2019
Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen, Yifei Cheng

Figure 1 for Variance Reduced Local SGD with Lower Communication Complexity
Figure 2 for Variance Reduced Local SGD with Lower Communication Complexity
Figure 3 for Variance Reduced Local SGD with Lower Communication Complexity
Figure 4 for Variance Reduced Local SGD with Lower Communication Complexity

To accelerate the training of machine learning models, distributed stochastic gradient descent (SGD) and its variants have been widely adopted, which apply multiple workers in parallel to speed up training. Among them, Local SGD has gained much attention due to its lower communication cost. Nevertheless, when the data distribution on workers is non-identical, Local SGD requires $O(T^{\frac{3}{4}} N^{\frac{3}{4}})$ communications to maintain its \emph{linear iteration speedup} property, where $T$ is the total number of iterations and $N$ is the number of workers. In this paper, we propose Variance Reduced Local SGD (VRL-SGD) to further reduce the communication complexity. Benefiting from eliminating the dependency on the gradient variance among workers, we theoretically prove that VRL-SGD achieves a \emph{linear iteration speedup} with a lower communication complexity $O(T^{\frac{1}{2}} N^{\frac{3}{2}})$ even if workers access non-identical datasets. We conduct experiments on three machine learning tasks, and the experimental results demonstrate that VRL-SGD performs impressively better than Local SGD when the data among workers are quite diverse.

* 25 pages, 6 figures. The paper presents a novel variance reduction algorithm for Local SGD 
Viaarxiv icon

Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent

Jun 28, 2019
Shuheng Shen, Linli Xu, Jingchang Liu, Xianfeng Liang, Yifei Cheng

Figure 1 for Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent
Figure 2 for Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent
Figure 3 for Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent
Figure 4 for Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent

With the increase in the amount of data and the expansion of model scale, distributed parallel training becomes an important and successful technique to address the optimization challenges. Nevertheless, although distributed stochastic gradient descent (SGD) algorithms can achieve a linear iteration speedup, they are limited significantly in practice by the communication cost, making it difficult to achieve a linear time speedup. In this paper, we propose a computation and communication decoupled stochastic gradient descent (CoCoD-SGD) algorithm to run computation and communication in parallel to reduce the communication cost. We prove that CoCoD-SGD has a linear iteration speedup with respect to the total computation capability of the hardware resources. In addition, it has a lower communication complexity and better time speedup comparing with traditional distributed SGD algorithms. Experiments on deep neural network training demonstrate the significant improvements of CoCoD-SGD: when training ResNet18 and VGG16 with 16 Geforce GTX 1080Ti GPUs, CoCoD-SGD is up to 2-3$\times$ faster than traditional synchronous SGD.

* IJCAI2019, 20 pages, 21 figures 
Viaarxiv icon

Asynchronous Stochastic Composition Optimization with Variance Reduction

Nov 15, 2018
Shuheng Shen, Linli Xu, Jingchang Liu, Junliang Guo, Qing Ling

Figure 1 for Asynchronous Stochastic Composition Optimization with Variance Reduction
Figure 2 for Asynchronous Stochastic Composition Optimization with Variance Reduction
Figure 3 for Asynchronous Stochastic Composition Optimization with Variance Reduction
Figure 4 for Asynchronous Stochastic Composition Optimization with Variance Reduction

Composition optimization has drawn a lot of attention in a wide variety of machine learning domains from risk management to reinforcement learning. Existing methods solving the composition optimization problem often work in a sequential and single-machine manner, which limits their applications in large-scale problems. To address this issue, this paper proposes two asynchronous parallel variance reduced stochastic compositional gradient (AsyVRSC) algorithms that are suitable to handle large-scale data sets. The two algorithms are AsyVRSC-Shared for the shared-memory architecture and AsyVRSC-Distributed for the master-worker architecture. The embedded variance reduction techniques enable the algorithms to achieve linear convergence rates. Furthermore, AsyVRSC-Shared and AsyVRSC-Distributed enjoy provable linear speedup, when the time delays are bounded by the data dimensionality or the sparsity ratio of the partial gradients, respectively. Extensive experiments are conducted to verify the effectiveness of the proposed algorithms.

* 30 pages, 19 figures 
Viaarxiv icon