Paul C. Lauterbur Research Center for Biomedical Imaging, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China




Abstract:In this paper, we primarily focus on understanding the data preprocessing pipeline for DNN Training in the public cloud. First, we run experiments to test the performance implications of the two major data preprocessing methods using either raw data or record files. The preliminary results show that data preprocessing is a clear bottleneck, even with the most efficient software and hardware configuration enabled by NVIDIA DALI, a high-optimized data preprocessing library. Second, we identify the potential causes, exercise a variety of optimization methods, and present their pros and cons. We hope this work will shed light on the new co-design of ``data storage, loading pipeline'' and ``training framework'' and flexible resource configurations between them so that the resources can be fully exploited and performance can be maximized.




Abstract:Deep learning-based methods have achieved encouraging performances in the field of magnetic resonance (MR) image reconstruction. Nevertheless, to properly learn a powerful and robust model, these methods generally require large quantities of data, the collection of which from multiple centers may cause ethical and data privacy violation issues. Lately, federated learning has served as a promising solution to exploit multi-center data while getting rid of the data transfer between institutions. However, high heterogeneity exists in the data from different centers, and existing federated learning methods tend to use average aggregation methods to combine the client's information, which limits the performance and generalization capability of the trained models. In this paper, we propose a Model-based Federated learning framework (ModFed). ModFed has three major contributions: 1) Different from the existing data-driven federated learning methods, model-driven neural networks are designed to relieve each client's dependency on large data; 2) An adaptive dynamic aggregation scheme is proposed to address the data heterogeneity issue and improve the generalization capability and robustness the trained neural network models; 3) A spatial Laplacian attention mechanism and a personalized client-side loss regularization are introduced to capture the detailed information for accurate image reconstruction. ModFed is evaluated on three in-vivo datasets. Experimental results show that ModFed has strong capability in improving image reconstruction quality and enforcing model generalization capability when compared to the other five state-of-the-art federated learning approaches. Codes will be made available at https://github.com/ternencewu123/ModFed.
Abstract:The ability to incrementally learn new classes from limited samples is crucial to the development of artificial intelligence systems for real clinical application. Although existing incremental learning techniques have attempted to address this issue, they still struggle with only few labeled data, particularly when the samples are from varied domains. In this paper, we explore the cross-domain few-shot incremental learning (CDFSCIL) problem. CDFSCIL requires models to learn new classes from very few labeled samples incrementally, and the new classes may be vastly different from the target space. To counteract this difficulty, we propose a cross-domain enhancement constraint and cross-domain data augmentation method. Experiments on MedMNIST show that the classification performance of this method is better than other similar incremental learning methods.




Abstract:Post-training quantization (\ptq) had been recently shown as a compromising method to reduce memory consumption and/or compute cost for large language models. However, a comprehensive study about the effect of different quantization schemes, different model families, different \ptq methods, different quantization bit precision, etc, is still missing. In this work, we provide an extensive study of those components over tens of thousands of zero-shot experiments. Our results show that (1) Fine-grained quantization and \ptq methods (instead of naive round-to-nearest quantization) are necessary to achieve good accuracy and (2) Higher bits (e.g., 5 bits) with coarse-grained quantization is more powerful than lower bits (e.g., 4 bits) with very fine-grained quantization (whose effective bit precision is similar to 5 bits). We also present recommendations about how to utilize quantization for \llms with different sizes, and leave suggestions of future opportunities and system work that are not resolved in this work.
Abstract:Multi-modal representation methods have achieved advanced performance in medical applications by extracting more robust features from multi-domain data. However, existing methods usually need to train additional branches for downstream tasks, which may increase the model complexities in clinical applications as well as introduce additional human inductive bias. Besides, very few studies exploit the rich clinical knowledge embedded in clinical daily reports. To this end, we propose a novel medical generalist agent, MGA, that can address three kinds of common clinical tasks via clinical reports knowledge transformation. Unlike the existing methods, MGA can easily adapt to different tasks without specific downstream branches when their corresponding annotations are missing. More importantly, we are the first attempt to use medical professional language guidance as a transmission medium to guide the agent's behavior. The proposed method is implemented on four well-known X-ray open-source datasets, MIMIC-CXR, CheXpert, MIMIC-CXR-JPG, and MIMIC-CXR-MS. Promising results are obtained, which validate the effectiveness of our proposed MGA. Code is available at: https://github.com/SZUHvern/MGA




Abstract:Improving the deployment efficiency of transformer-based language models has been challenging given their high computation and memory cost. While INT8 quantization has recently been shown to be effective in reducing both the memory cost and latency while preserving model accuracy, it remains unclear whether we can leverage INT4 (which doubles peak hardware throughput) to achieve further latency improvement. In this work, we fully investigate the feasibility of using INT4 quantization for language models, and show that using INT4 introduces no or negligible accuracy degradation for encoder-only and encoder-decoder models, but causes a significant accuracy drop for decoder-only models. To materialize the performance gain using INT4, we develop a highly-optimized end-to-end INT4 encoder inference pipeline supporting different quantization strategies. Our INT4 pipeline is $8.5\times$ faster for latency-oriented scenarios and up to $3\times$ for throughput-oriented scenarios compared to the inference of FP16, and improves the SOTA BERT INT8 performance from FasterTransformer by up to $1.7\times$. We also provide insights into the failure cases when applying INT4 to decoder-only models, and further explore the compatibility of INT4 quantization with other compression techniques, like pruning and layer reduction.




Abstract:With the growing model size, deep neural networks (DNN) are increasingly trained over massive GPU accelerators, which demands a proper parallelization plan that transforms a DNN model into fine-grained tasks and then schedules them to GPUs for execution. Due to the large search space, the contemporary parallelization plan generators often rely on empirical rules that couple transformation and scheduling, and fall short in exploring more flexible schedules that yield better memory usage and compute efficiency. This tension can be exacerbated by the emerging models with increasing complexity in their structure and model size. SuperScaler is a system that facilitates the design and generation of highly flexible parallelization plans. It formulates the plan design and generation into three sequential phases explicitly: model transformation, space-time scheduling, and data dependency preserving. Such a principled approach decouples multiple seemingly intertwined factors and enables the composition of highly flexible parallelization plans. As a result, SuperScaler can not only generate empirical parallelization plans, but also construct new plans that achieve up to 3.5X speedup compared to state-of-the-art solutions like DeepSpeed, Megatron and Alpa, for emerging DNN models like Swin-Transformer and AlphaFold2, as well as well-optimized models like GPT-3.
Abstract:Magnetic Resonance Imaging (MRI) has become an important technique in the clinic for the visualization, detection, and diagnosis of various diseases. However, one bottleneck limitation of MRI is the relatively slow data acquisition process. Fast MRI based on k-space undersampling and high-quality image reconstruction has been widely utilized, and many deep learning-based methods have been developed in recent years. Although promising results have been achieved, most existing methods require fully-sampled reference data for training the deep learning models. Unfortunately, fully-sampled MRI data are difficult if not impossible to obtain in real-world applications. To address this issue, we propose a data refinement framework for self-supervised MR image reconstruction. Specifically, we first analyze the reason of the performance gap between self-supervised and supervised methods and identify that the bias in the training datasets between the two is one major factor. Then, we design an effective self-supervised training data refinement method to reduce this data bias. With the data refinement, an enhanced self-supervised MR image reconstruction framework is developed to prompt accurate MR imaging. We evaluate our method on an in-vivo MRI dataset. Experimental results show that without utilizing any fully sampled MRI data, our self-supervised framework possesses strong capabilities in capturing image details and structures at high acceleration factors.




Abstract:Large-scale transformer models have become the de-facto architectures for various machine learning applications, e.g., CV and NLP. However, those large models also introduce prohibitive training costs. To mitigate this issue, we propose a novel random and layerwise token dropping method (random-LTD), which skips the computation of a subset of the input tokens at all middle layers. Particularly, random-LTD achieves considerable speedups and comparable accuracy as the standard training baseline. Compared to other token dropping methods, random-LTD does not require (1) any importance score-based metrics, (2) any special token treatment (e.g., [CLS]), and (3) many layers in full sequence length training except the first and the last layers. Besides, a new LayerToken learning rate schedule is proposed for pretraining problems that resolve the heavy tuning requirement for our proposed training mechanism. Finally, we demonstrate that random-LTD can be applied to broader applications, including GPT and BERT pretraining as well as ViT and GPT finetuning tasks. Our results show that random-LTD can save about 33.3% theoretical compute cost and 25.6% wall-clock training time while achieving similar zero-shot evaluations on GPT-31.3B as compared to baseline.



Abstract:Volumetric magnetic resonance (MR) image segmentation plays an important role in many clinical applications. Deep learning (DL) has recently achieved state-of-the-art or even human-level performance on various image segmentation tasks. Nevertheless, manually annotating volumetric MR images for DL model training is labor-exhaustive and time-consuming. In this work, we aim to train a semi-supervised and self-supervised collaborative learning framework for prostate 3D MR image segmentation while using extremely sparse annotations, for which the ground truth annotations are provided for just the central slice of each volumetric MR image. Specifically, semi-supervised learning and self-supervised learning methods are used to generate two independent sets of pseudo labels. These pseudo labels are then fused by Boolean operation to extract a more confident pseudo label set. The images with either manual or network self-generated labels are then employed to train a segmentation model for target volume extraction. Experimental results on a publicly available prostate MR image dataset demonstrate that, while requiring significantly less annotation effort, our framework generates very encouraging segmentation results. The proposed framework is very useful in clinical applications when training data with dense annotations are difficult to obtain.