Alert button
Picture for Guanhua Wang

Guanhua Wang

Alert button

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

Jun 16, 2023
Guanhua Wang, Heyang Qin, Sam Ade Jacobs, Connor Holmes, Samyam Rajbhandari, Olatunji Ruwase, Feng Yan, Lei Yang, Yuxiong He

Figure 1 for ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
Figure 2 for ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
Figure 3 for ZeRO++: Extremely Efficient Collective Communication for Giant Model Training
Figure 4 for ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large language models on massive GPUs clusters due to its ease of use, efficiency, and good scalability. However, when training on low-bandwidth clusters, or at scale which forces batch size per GPU to be small, ZeRO's effective throughput is limited because of high communication volume from gathering weights in forward pass, backward pass, and averaging gradients. This paper introduces three communication volume reduction techniques, which we collectively refer to as ZeRO++, targeting each of the communication collectives in ZeRO. First is block-quantization based all-gather. Second is data remapping that trades-off communication for more memory. Third is a novel all-to-all based quantized gradient averaging paradigm as replacement of reduce-scatter collective, which preserves accuracy despite communicating low precision data. Collectively, ZeRO++ reduces communication volume of ZeRO by 4x, enabling up to 2.16x better throughput at 384 GPU scale.

* 12 pages 
Viaarxiv icon

Adaptive Sampling for Linear Sensing Systems via Langevin Dynamics

Feb 27, 2023
Guanhua Wang, Douglas C. Noll, Jeffrey A. Fessler

Figure 1 for Adaptive Sampling for Linear Sensing Systems via Langevin Dynamics
Figure 2 for Adaptive Sampling for Linear Sensing Systems via Langevin Dynamics
Figure 3 for Adaptive Sampling for Linear Sensing Systems via Langevin Dynamics
Figure 4 for Adaptive Sampling for Linear Sensing Systems via Langevin Dynamics

Adaptive or dynamic signal sampling in sensing systems can adapt subsequent sampling strategies based on acquired signals, thereby potentially improving image quality and speed. This paper proposes a Bayesian method for adaptive sampling based on greedy variance reduction and stochastic gradient Langevin dynamics (SGLD). The image priors involved can be either analytical or neural network-based. Notably, the learned image priors generalize well to out-of-distribution test cases that have different statistics than the training dataset. As a real-world validation, the method is applied to accelerate the acquisition of magnetic resonance imaging (MRI). Compared to non-adaptive sampling, the proposed method effectively improved the image quality by 2-3 dB in PSNR, and improved the restoration of subtle details.

* 5 pages, 4 figures 
Viaarxiv icon

Stochastic Optimization of 3D Non-Cartesian Sampling Trajectory (SNOPY)

Sep 22, 2022
Guanhua Wang, Jon-Fredrik Nielsen, Jeffrey A. Fessler, Douglas C. Noll

Figure 1 for Stochastic Optimization of 3D Non-Cartesian Sampling Trajectory (SNOPY)
Figure 2 for Stochastic Optimization of 3D Non-Cartesian Sampling Trajectory (SNOPY)
Figure 3 for Stochastic Optimization of 3D Non-Cartesian Sampling Trajectory (SNOPY)
Figure 4 for Stochastic Optimization of 3D Non-Cartesian Sampling Trajectory (SNOPY)

Optimizing 3D k-space sampling trajectories for efficient MRI is important yet challenging. This work proposes a generalized framework for optimizing 3D non-Cartesian sampling patterns via data-driven optimization. We built a differentiable MRI system model to enable gradient-based methods for sampling trajectory optimization. By combining training losses, the algorithm can simultaneously optimize multiple properties of sampling patterns, including image quality, hardware constraints (maximum slew rate and gradient strength), reduced peripheral nerve stimulation (PNS), and parameter-weighted contrast. The proposed method can either optimize the gradient waveform (spline-based freeform optimization) or optimize properties of given sampling trajectories (such as the rotation angle of radial trajectories). Notably, the method optimizes sampling trajectories synergistically with either model-based or learning-based reconstruction methods. We proposed several strategies to alleviate the severe non-convexity and huge computation demand posed by the high-dimensional optimization. The corresponding code is organized as an open-source, easy-to-use toolbox. We applied the optimized trajectory to multiple applications including structural and functional imaging. In the simulation studies, the reconstruction PSNR of a 3D kooshball trajectory was increased by 4 dB with SNOPY optimization. In the prospective studies, by optimizing the rotation angles of a stack-of-stars (SOS) trajectory, SNOPY improved the PSNR by 1.4dB compared to the best empirical method. Optimizing the gradient waveform of a rotational EPI trajectory improved subjects' rating of the PNS effect from 'strong' to 'mild.' In short, SNOPY provides an efficient data-driven and optimization-based method to tailor non-Cartesian sampling trajectories.

* 13 pages, 8 figures 
Viaarxiv icon

Efficient approximation of Jacobian matrices involving a non-uniform fast Fourier transform (NUFFT)

Nov 04, 2021
Guanhua Wang, Jeffrey A. Fessler

Figure 1 for Efficient approximation of Jacobian matrices involving a non-uniform fast Fourier transform (NUFFT)
Figure 2 for Efficient approximation of Jacobian matrices involving a non-uniform fast Fourier transform (NUFFT)
Figure 3 for Efficient approximation of Jacobian matrices involving a non-uniform fast Fourier transform (NUFFT)
Figure 4 for Efficient approximation of Jacobian matrices involving a non-uniform fast Fourier transform (NUFFT)

There is growing interest in learning k-space sampling patterns for MRI using optimization approaches. For non-Cartesian sampling patterns, reconstruction methods typically involve non-uniform FFT (NUFFT) operations. A typical NUFFT method contains frequency domain interpolation using Kaiser-Bessel kernel values that are retrieved by nearest neighbor look-up in a finely tabulated kernel. That look-up operation is not differentiable with respect to the sampling pattern, complicating auto-differentiation routines for backpropagation (stochastic gradient descent) for sampling pattern optimization. This paper describes an efficient and accurate approach for computing approximate gradients with respect to the sampling pattern for learning k-space sampling. Various numerical experiments validate the accuracy of the proposed approximation. We also showcase the trajectories optimized for different iterative reconstruction algorithms, including smooth convex regularized reconstruction and compressed sensing-based reconstruction.

* 9 pages, 4 figures 
Viaarxiv icon

Blind Primed Supervised (BLIPS) Learning for MR Image Reconstruction

Apr 11, 2021
Anish Lahiri, Guanhua Wang, Saiprasad Ravishankar, Jeffrey A. Fessler

Figure 1 for Blind Primed Supervised (BLIPS) Learning for MR Image Reconstruction
Figure 2 for Blind Primed Supervised (BLIPS) Learning for MR Image Reconstruction
Figure 3 for Blind Primed Supervised (BLIPS) Learning for MR Image Reconstruction
Figure 4 for Blind Primed Supervised (BLIPS) Learning for MR Image Reconstruction

This paper examines a combined supervised-unsupervised framework involving dictionary-based blind learning and deep supervised learning for MR image reconstruction from under-sampled k-space data. A major focus of the work is to investigate the possible synergy of learned features in traditional shallow reconstruction using adaptive sparsity-based priors and deep prior-based reconstruction. Specifically, we propose a framework that uses an unrolled network to refine a blind dictionary learning-based reconstruction. We compare the proposed method with strictly supervised deep learning-based reconstruction approaches on several datasets of varying sizes and anatomies. We also compare the proposed method to alternative approaches for combining dictionary-based methods with supervised learning in MR image reconstruction. The improvements yielded by the proposed framework suggest that the blind dictionary-based approach preserves fine image details that the supervised approach can iteratively refine, suggesting that the features learned using the two methods are complementary

Viaarxiv icon

B-spline Parameterized Joint Optimization of Reconstruction and K-space Trajectories (BJORK) for Accelerated 2D MRI

Jan 27, 2021
Guanhua Wang, Tianrui Luo, Jon-Fredrik Nielsen, Douglas C. Noll, Jeffrey A. Fessler

Figure 1 for B-spline Parameterized Joint Optimization of Reconstruction and K-space Trajectories (BJORK) for Accelerated 2D MRI
Figure 2 for B-spline Parameterized Joint Optimization of Reconstruction and K-space Trajectories (BJORK) for Accelerated 2D MRI
Figure 3 for B-spline Parameterized Joint Optimization of Reconstruction and K-space Trajectories (BJORK) for Accelerated 2D MRI
Figure 4 for B-spline Parameterized Joint Optimization of Reconstruction and K-space Trajectories (BJORK) for Accelerated 2D MRI

Optimizing k-space sampling trajectories is a challenging topic for fast magnetic resonance imaging (MRI). This work proposes to optimize a reconstruction algorithm and sampling trajectories jointly concerning image reconstruction quality. We parameterize trajectories with quadratic B-spline kernels to reduce the number of parameters and enable multi-scale optimization, which may help to avoid sub-optimal local minima. The algorithm includes an efficient non-Cartesian unrolled neural network-based reconstruction and an accurate approximation for backpropagation through the non-uniform fast Fourier transform (NUFFT) operator to accurately reconstruct and back-propagate multi-coil non-Cartesian data. Penalties on slew rate and gradient amplitude enforce hardware constraints. Sampling and reconstruction are trained jointly using large public datasets. To correct the potential eddy-current effect introduced by the curved trajectory, we use a pencil-beam trajectory mapping technique. In both simulations and in-vivo experiments, the learned trajectory demonstrates significantly improved image quality compared to previous model-based and learning-based trajectory optimization methods for 20x acceleration factors. Though trained with neural network-based reconstruction, the proposed trajectory also leads to improved image quality with compressed sensing-based reconstruction.

* 15 pages, 13 figures 
Viaarxiv icon

Failout: Achieving Failure-Resilient Inference in Distributed Neural Networks

Feb 18, 2020
Ashkan Yousefpour, Brian Q. Nguyen, Siddartha Devic, Guanhua Wang, Aboudy Kreidieh, Hans Lobel, Alexandre M. Bayen, Jason P. Jue

Figure 1 for Failout: Achieving Failure-Resilient Inference in Distributed Neural Networks
Figure 2 for Failout: Achieving Failure-Resilient Inference in Distributed Neural Networks
Figure 3 for Failout: Achieving Failure-Resilient Inference in Distributed Neural Networks
Figure 4 for Failout: Achieving Failure-Resilient Inference in Distributed Neural Networks

When a neural network is partitioned and distributed across physical nodes, failure of physical nodes causes the failure of the neural units that are placed on those nodes, which results in a significant performance drop. Current approaches focus on resiliency of training in distributed neural networks. However, resiliency of inference in distributed neural networks is less explored. We introduce ResiliNet, a scheme for making inference in distributed neural networks resilient to physical node failures. ResiliNet combines two concepts to provide resiliency: skip connection in residual neural networks, and a novel technique called failout, which is introduced in this paper. Failout simulates physical node failure conditions during training using dropout, and is specifically designed to improve the resiliency of distributed neural networks. The results of the experiments and ablation studies using three datasets confirm the ability of ResiliNet to provide inference resiliency for distributed neural networks.

* 10 pages 
Viaarxiv icon

Blink: Fast and Generic Collectives for Distributed ML

Oct 11, 2019
Guanhua Wang, Shivaram Venkataraman, Amar Phanishayee, Jorgen Thelin, Nikhil Devanur, Ion Stoica

Figure 1 for Blink: Fast and Generic Collectives for Distributed ML
Figure 2 for Blink: Fast and Generic Collectives for Distributed ML
Figure 3 for Blink: Fast and Generic Collectives for Distributed ML
Figure 4 for Blink: Fast and Generic Collectives for Distributed ML

Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale. Existing parameter synchronization protocols cannot effectively leverage available network resources in the face of ever increasing hardware heterogeneity. To address this, we propose Blink, a collective communication library that dynamically generates optimal communication primitives by packing spanning trees. We propose techniques to minimize the number of trees generated and extend Blink to leverage heterogeneous communication channels for faster data transfers. Evaluations show that compared to the state-of-the-art (NCCL), Blink can achieve up to 8x faster model synchronization, and reduce end-to-end training time for image classification tasks by up to 40%.

Viaarxiv icon

Gathering Cyber Threat Intelligence from Twitter Using Novelty Classification

Jul 03, 2019
Ba Dung Le, Guanhua Wang, Mehwish Nasim, Ali Babar

Figure 1 for Gathering Cyber Threat Intelligence from Twitter Using Novelty Classification
Figure 2 for Gathering Cyber Threat Intelligence from Twitter Using Novelty Classification
Figure 3 for Gathering Cyber Threat Intelligence from Twitter Using Novelty Classification
Figure 4 for Gathering Cyber Threat Intelligence from Twitter Using Novelty Classification

Preventing organizations from Cyber exploits needs timely intelligence about Cyber vulnerabilities and attacks, referred as threats. Cyber threat intelligence can be extracted from various sources including social media platforms where users publish the threat information in real time. Gathering Cyber threat intelligence from social media sites is a time consuming task for security analysts that can delay timely response to emerging Cyber threats. We propose a framework for automatically gathering Cyber threat intelligence from Twitter by using a novelty detection model. Our model learns the features of Cyber threat intelligence from the threat descriptions published in public repositories such as Common Vulnerabilities and Exposures (CVE) and classifies a new unseen tweet as either normal or anomalous to Cyber threat intelligence. We evaluate our framework using a purpose-built data set of tweets from 50 influential Cyber security related accounts over twelve months (in 2018). Our classifier achieves the F1-score of 0.643 for classifying Cyber threat tweets and outperforms several baselines including binary classification models. Our analysis of the classification results suggests that Cyber threat relevant tweets on Twitter do not often include the CVE identifier of the related threats. Hence, it would be valuable to collect these tweets and associate them with the related CVE identifier for cyber security applications.

* ACCEPTED by the 2019 International Conference on Cyberworlds (CW2019) 
Viaarxiv icon