Alert button
Picture for Cong Xie

Cong Xie

Alert button

LEMON: Lossless model expansion

Oct 12, 2023
Yite Wang, Jiahao Su, Hanlin Lu, Cong Xie, Tianyi Liu, Jianbo Yuan, Haibin Lin, Ruoyu Sun, Hongxia Yang

Scaling of deep neural networks, especially Transformers, is pivotal for their surging performance and has further led to the emergence of sophisticated reasoning capabilities in foundation models. Such scaling generally requires training large models from scratch with random initialization, failing to leverage the knowledge acquired by their smaller counterparts, which are already resource-intensive to obtain. To tackle this inefficiency, we present $\textbf{L}$ossl$\textbf{E}$ss $\textbf{MO}$del Expansio$\textbf{N}$ (LEMON), a recipe to initialize scaled models using the weights of their smaller but pre-trained counterparts. This is followed by model training with an optimized learning rate scheduler tailored explicitly for the scaled models, substantially reducing the training time compared to training from scratch. Notably, LEMON is versatile, ensuring compatibility with various network structures, including models like Vision Transformers and BERT. Our empirical results demonstrate that LEMON reduces computational costs by 56.7% for Vision Transformers and 33.2% for BERT when compared to training from scratch.

* Preprint 
Viaarxiv icon

Baechi: Fast Device Placement of Machine Learning Graphs

Jan 20, 2023
Beomyeol Jeon, Linda Cai, Chirag Shetty, Pallavi Srivastava, Jintao Jiang, Xiaolan Ke, Yitao Meng, Cong Xie, Indranil Gupta

Figure 1 for Baechi: Fast Device Placement of Machine Learning Graphs
Figure 2 for Baechi: Fast Device Placement of Machine Learning Graphs
Figure 3 for Baechi: Fast Device Placement of Machine Learning Graphs
Figure 4 for Baechi: Fast Device Placement of Machine Learning Graphs

Machine Learning graphs (or models) can be challenging or impossible to train when either devices have limited memory, or models are large. To split the model across devices, learning-based approaches are still popular. While these result in model placements that train fast on data (i.e., low step times), learning-based model-parallelism is time-consuming, taking many hours or days to create a placement plan of operators on devices. We present the Baechi system, the first to adopt an algorithmic approach to the placement problem for running machine learning training graphs on small clusters of memory-constrained devices. We integrate our implementation of Baechi into two popular open-source learning frameworks: TensorFlow and PyTorch. Our experimental results using GPUs show that: (i) Baechi generates placement plans 654 X - 206K X faster than state-of-the-art learning-based approaches, and (ii) Baechi-placed model's step (training) time is comparable to expert placements in PyTorch, and only up to 6.2% worse than expert placements in TensorFlow. We prove mathematically that our two algorithms are within a constant factor of the optimal. Our work shows that compared to learning-based approaches, algorithmic approaches can face different challenges for adaptation to Machine learning systems, but also they offer proven bounds, and significant performance benefits.

* Extended version of SoCC 2020 paper: https://dl.acm.org/doi/10.1145/3419111.3421302 
Viaarxiv icon

Learning Shape Priors by Pairwise Comparison for Robust Semantic Segmentation

Apr 23, 2022
Cong Xie, Hualuo Liu, Shilei Cao, Dong Wei, Kai Ma, Liansheng Wang, Yefeng Zheng

Figure 1 for Learning Shape Priors by Pairwise Comparison for Robust Semantic Segmentation
Figure 2 for Learning Shape Priors by Pairwise Comparison for Robust Semantic Segmentation

Semantic segmentation is important in medical image analysis. Inspired by the strong ability of traditional image analysis techniques in capturing shape priors and inter-subject similarity, many deep learning (DL) models have been recently proposed to exploit such prior information and achieved robust performance. However, these two types of important prior information are usually studied separately in existing models. In this paper, we propose a novel DL model to model both type of priors within a single framework. Specifically, we introduce an extra encoder into the classic encoder-decoder structure to form a Siamese structure for the encoders, where one of them takes a target image as input (the image-encoder), and the other concatenates a template image and its foreground regions as input (the template-encoder). The template-encoder encodes the shape priors and appearance characteristics of each foreground class in the template image. A cosine similarity based attention module is proposed to fuse the information from both encoders, to utilize both types of prior information encoded by the template-encoder and model the inter-subject similarity for each foreground class. Extensive experiments on two public datasets demonstrate that our proposed method can produce superior performance to competing methods.

* IEEE ISBI 2021 
Viaarxiv icon

RECIST-Net: Lesion detection via grouping keypoints on RECIST-based annotation

Jul 19, 2021
Cong Xie, Shilei Cao, Dong Wei, Hongyu Zhou, Kai Ma, Xianli Zhang, Buyue Qian, Liansheng Wang, Yefeng Zheng

Figure 1 for RECIST-Net: Lesion detection via grouping keypoints on RECIST-based annotation
Figure 2 for RECIST-Net: Lesion detection via grouping keypoints on RECIST-based annotation
Figure 3 for RECIST-Net: Lesion detection via grouping keypoints on RECIST-based annotation
Figure 4 for RECIST-Net: Lesion detection via grouping keypoints on RECIST-based annotation

Universal lesion detection in computed tomography (CT) images is an important yet challenging task due to the large variations in lesion type, size, shape, and appearance. Considering that data in clinical routine (such as the DeepLesion dataset) are usually annotated with a long and a short diameter according to the standard of Response Evaluation Criteria in Solid Tumors (RECIST) diameters, we propose RECIST-Net, a new approach to lesion detection in which the four extreme points and center point of the RECIST diameters are detected. By detecting a lesion as keypoints, we provide a more conceptually straightforward formulation for detection, and overcome several drawbacks (e.g., requiring extensive effort in designing data-appropriate anchors and losing shape information) of existing bounding-box-based methods while exploring a single-task, one-stage approach compared to other RECIST-based approaches. Experiments show that RECIST-Net achieves a sensitivity of 92.49% at four false positives per image, outperforming other recent methods including those using multi-task learning.

* 5 pages, 3 figures, IEEE ISBI 2021 
Viaarxiv icon

Compressed Communication for Distributed Training: Adaptive Methods and System

May 17, 2021
Yuchen Zhong, Cong Xie, Shuai Zheng, Haibin Lin

Figure 1 for Compressed Communication for Distributed Training: Adaptive Methods and System
Figure 2 for Compressed Communication for Distributed Training: Adaptive Methods and System
Figure 3 for Compressed Communication for Distributed Training: Adaptive Methods and System
Figure 4 for Compressed Communication for Distributed Training: Adaptive Methods and System

Communication overhead severely hinders the scalability of distributed machine learning systems. Recently, there has been a growing interest in using gradient compression to reduce the communication overhead of the distributed training. However, there is little understanding of applying gradient compression to adaptive gradient methods. Moreover, its performance benefits are often limited by the non-negligible compression overhead. In this paper, we first introduce a novel adaptive gradient method with gradient compression. We show that the proposed method has a convergence rate of $\mathcal{O}(1/\sqrt{T})$ for non-convex problems. In addition, we develop a scalable system called BytePS-Compress for two-way compression, where the gradients are compressed in both directions between workers and parameter servers. BytePS-Compress pipelines the compression and decompression on CPUs and achieves a high degree of parallelism. Empirical evaluations show that we improve the training time of ResNet50, VGG16, and BERT-base by 5.0%, 58.1%, 23.3%, respectively, without any accuracy loss with 25 Gb/s networking. Furthermore, for training the BERT models, we achieve a compression rate of 333x compared to the mixed-precision training.

Viaarxiv icon

Visual Steering for One-Shot Deep Neural Network Synthesis

Sep 28, 2020
Anjul Tyagi, Cong Xie, Klaus Mueller

Figure 1 for Visual Steering for One-Shot Deep Neural Network Synthesis
Figure 2 for Visual Steering for One-Shot Deep Neural Network Synthesis
Figure 3 for Visual Steering for One-Shot Deep Neural Network Synthesis
Figure 4 for Visual Steering for One-Shot Deep Neural Network Synthesis

Recent advancements in the area of deep learning have shown the effectiveness of very large neural networks in several applications. However, as these deep neural networks continue to grow in size, it becomes more and more difficult to configure their many parameters to obtain good results. Presently, analysts must experiment with many different configurations and parameter settings, which is labor-intensive and time-consuming. On the other hand, the capacity of fully automated techniques for neural network architecture search is limited without the domain knowledge of human experts. To deal with the problem, we formulate the task of neural network architecture optimization as a graph space exploration, based on the one-shot architecture search technique. In this approach, a super-graph of all candidate architectures is trained in one-shot and the optimal neural network is identified as a sub-graph. In this paper, we present a framework that allows analysts to effectively build the solution sub-graph space and guide the network search by injecting their domain knowledge. Starting with the network architecture space composed of basic neural network components, analysts are empowered to effectively select the most promising components via our one-shot search scheme. Applying this technique in an iterative manner allows analysts to converge to the best performing neural network architecture for a given application. During the exploration, analysts can use their domain knowledge aided by cues provided from a scatterplot visualization of the search space to edit different components and guide the search for faster convergence. We designed our interface in collaboration with several deep learning researchers and its final effectiveness is evaluated with a user study and two case studies.

* 9 pages, submitted to IEEE Transactions on Visualization and Computer Graphics, 2020 
Viaarxiv icon

SMAP: A Joint Dimensionality Reduction Scheme for Secure Multi-Party Visualization

Jul 30, 2020
Jiazhi Xia, Tianxiang Chen, Lei Zhang, Wei Chen, Yang Chen, Xiaolong Zhang, Cong Xie, Tobias Schreck

Figure 1 for SMAP: A Joint Dimensionality Reduction Scheme for Secure Multi-Party Visualization
Figure 2 for SMAP: A Joint Dimensionality Reduction Scheme for Secure Multi-Party Visualization
Figure 3 for SMAP: A Joint Dimensionality Reduction Scheme for Secure Multi-Party Visualization
Figure 4 for SMAP: A Joint Dimensionality Reduction Scheme for Secure Multi-Party Visualization

Nowadays, as data becomes increasingly complex and distributed, data analyses often involve several related datasets that are stored on different servers and probably owned by different stakeholders. While there is an emerging need to provide these stakeholders with a full picture of their data under a global context, conventional visual analytical methods, such as dimensionality reduction, could expose data privacy when multi-party datasets are fused into a single site to build point-level relationships. In this paper, we reformulate the conventional t-SNE method from the single-site mode into a secure distributed infrastructure. We present a secure multi-party scheme for joint t-SNE computation, which can minimize the risk of data leakage. Aggregated visualization can be optionally employed to hide disclosure of point-level relationships. We build a prototype system based on our method, SMAP, to support the organization, computation, and exploration of secure joint embedding. We demonstrate the effectiveness of our approach with three case studies, one of which is based on the deployment of our system in real-world applications.

* 12 pages, 10 figures. Conditionally accepted by VAST 2020 
Viaarxiv icon

CSER: Communication-efficient SGD with Error Reset

Jul 29, 2020
Cong Xie, Shuai Zheng, Oluwasanmi Koyejo, Indranil Gupta, Mu Li, Haibin Lin

Figure 1 for CSER: Communication-efficient SGD with Error Reset
Figure 2 for CSER: Communication-efficient SGD with Error Reset
Figure 3 for CSER: Communication-efficient SGD with Error Reset
Figure 4 for CSER: Communication-efficient SGD with Error Reset

The scalability of Distributed Stochastic Gradient Descent (SGD) is today limited by communication bottlenecks. We propose a novel SGD variant: Communication-efficient SGD with Error Reset, or CSER. The key idea in CSER is first a new technique called "error reset" that adapts arbitrary compressors for SGD, producing bifurcated local models with periodic reset of resulting local residual errors. Second we introduce partial synchronization for both the gradients and the models, leveraging advantages from them. We prove the convergence of CSER for smooth non-convex problems. Empirical results show that when combined with highly aggressive compressors, the CSER algorithms: i) cause no loss of accuracy, and ii) accelerate the training by nearly $10\times$ for CIFAR-100, and by $4.5\times$ for ImageNet.

Viaarxiv icon

Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates

Nov 20, 2019
Cong Xie, Oluwasanmi Koyejo, Indranil Gupta, Haibin Lin

Figure 1 for Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates
Figure 2 for Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates
Figure 3 for Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates
Figure 4 for Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates

Recent years have witnessed the growth of large-scale distributed machine learning algorithms -- specifically designed to accelerate model training by distributing computation across multiple machines. When scaling distributed training in this way, the communication overhead is often the bottleneck. In this paper, we study the local distributed Stochastic Gradient Descent~(SGD) algorithm, which reduces the communication overhead by decreasing the frequency of synchronization. While SGD with adaptive learning rates is a widely adopted strategy for training neural networks, it remains unknown how to implement adaptive learning rates in local SGD. To this end, we propose a novel SGD variant with reduced communication and adaptive learning rates, with provable convergence. Empirical results show that the proposed algorithm has fast convergence and efficiently reduces the communication overhead.

Viaarxiv icon