Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jianfei Chen

COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

Oct 25, 2024

Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen, Song Han

Figure 1 for COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

Figure 2 for COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

Figure 3 for COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

Figure 4 for COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training

Abstract:FP8 training has emerged as a promising method for improving training efficiency. Existing frameworks accelerate training by applying FP8 computation to linear layers while leaving optimizer states and activations in higher precision, which fails to fully optimize memory usage. This paper introduces COAT (Compressing Optimizer States and Activations for FP8 Training), a novel FP8 training framework designed to significantly reduce memory footprint when training large models. COAT addresses current limitations through two key innovations: (1) Dynamic Range Expansion, which aligns optimizer state distributions more closely with the FP8 representation range, thereby reducing quantization error, and (2) Mixed-Granularity Activation Quantization, which optimizes activation memory using a combination of per-tensor and per-group quantization strategies. Experiments demonstrate that COAT effectively reduces end-to-end training memory footprint by 1.54x compared to BF16 while achieving nearly lossless performance across various tasks, such as Large Language Model pretraining and fine-tuning and Vision Language Model training. COAT also achieves a 1.43x end-to-end training speedup compared to BF16, performing on par with or surpassing TransformerEngine's speedup. COAT enables efficient full-parameter training of large models on fewer GPUs, and facilitates doubling the batch size in distributed training settings, providing a practical solution for scaling large-scale model training. The code is available at https://github.com/NVlabs/COAT.

* 16 pages. 9 Figures. 8 Tables

Via

Access Paper or Ask Questions

Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs

Oct 21, 2024

Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen

Figure 1 for Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs

Figure 2 for Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs

Figure 3 for Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs

Figure 4 for Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs

Abstract:To date, 2:4 sparsity has stood as the only sparse pattern that can be accelerated using sparse tensor cores on GPUs. In practice, 2:4 sparsity often possesses low actual speedups ($\leq 1.3$) and requires fixed sparse ratios, meaning that other ratios, such as 4:8, 8:16, or those exceeding 50% sparsity, do not incur any speedups on GPUs. Recent studies suggest that V:N:M sparsity is promising in addressing these limitations of 2:4 sparsity. However, regarding accuracy, the effects of V:N:M sparsity on broader Transformer models, such as vision Transformers and large language models (LLMs), are largely unexamined. Moreover, Some specific issues related to V:N:M sparsity, such as how to select appropriate V and M values, remain unresolved. In this study, we thoroughly investigate the application of V:N:M sparsity in vision models and LLMs across multiple tasks, from pertaining to downstream tasks. We propose three key approaches to enhance the applicability and accuracy of V:N:M-sparse Transformers, including heuristic V and M selection, V:N:M-specific channel permutation, and three-staged LoRA training techniques. Experimental results show that, with our methods, the DeiT-small achieves lossless accuracy at 64:2:5 sparsity, while the DeiT-base maintains accuracy even at 64:2:8 sparsity. In addition, the fine-tuned LLama2-7B at 64:2:5 sparsity performs comparably or better than training-free 2:4 sparse alternatives on downstream tasks. More importantly, V:N:M-sparse Transformers offer a wider range of speedup-accuracy trade-offs compared to 2:4 sparsity. Overall, our exploration largely facilitates the V:N:M sparsity to act as a truly effective acceleration solution for Transformers in cost-sensitive inference scenarios.

Via

Access Paper or Ask Questions

FrameBridge: Improving Image-to-Video Generation with Bridge Models

Oct 20, 2024

Yuji Wang, Zehua Chen, Xiaoyu Chen, Jun Zhu, Jianfei Chen

Figure 1 for FrameBridge: Improving Image-to-Video Generation with Bridge Models

Figure 2 for FrameBridge: Improving Image-to-Video Generation with Bridge Models

Figure 3 for FrameBridge: Improving Image-to-Video Generation with Bridge Models

Figure 4 for FrameBridge: Improving Image-to-Video Generation with Bridge Models

Abstract:Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis. Recently, diffusion-based I2V models have achieved remarkable progress given their novel design on network architecture, cascaded framework, and motion representation. However, restricted by their noise-to-data generation process, diffusion-based methods inevitably suffer the difficulty to generate video samples with both appearance consistency and temporal coherence from an uninformative Gaussian noise, which may limit their synthesis quality. In this work, we present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them. By formulating I2V synthesis as a frames-to-frames generation task and modelling it with a data-to-data process, we fully exploit the information in input image and facilitate the generative model to learn the image animation process. In two popular settings of training I2V models, namely fine-tuning a pre-trained text-to-video (T2V) model or training from scratch, we further propose two techniques, SNR-Aligned Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively. Experiments conducted on WebVid-2M and UCF-101 demonstrate that: (1) our FrameBridge achieves superior I2V quality in comparison with the diffusion counterpart (zero-shot FVD 83 vs. 176 on MSR-VTT and non-zero-shot FVD 122 vs. 171 on UCF-101); (2) our proposed SAF and neural prior effectively enhance the ability of bridge-based I2V models in the scenarios of fine-tuning and training from scratch. Demo samples can be visited at: https://framebridge-demo.github.io/.

Via

Access Paper or Ask Questions

On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent

Oct 07, 2024

Bingrui Li, Wei Huang, Andi Han, Zhanpeng Zhou, Taiji Suzuki, Jun Zhu, Jianfei Chen

Figure 1 for On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent

Figure 2 for On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent

Figure 3 for On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent

Figure 4 for On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent

Abstract:The Adam optimizer is widely used for transformer optimization in practice, which makes understanding the underlying optimization mechanisms an important problem. However, due to the Adam's complexity, theoretical analysis of how it optimizes transformers remains a challenging task. Fortunately, Sign Gradient Descent (SignGD) serves as an effective surrogate for Adam. Despite its simplicity, theoretical understanding of how SignGD optimizes transformers still lags behind. In this work, we study how SignGD optimizes a two-layer transformer -- consisting of a softmax attention layer with trainable query-key parameterization followed by a linear layer -- on a linearly separable noisy dataset. We identify four stages in the training dynamics, each exhibiting intriguing behaviors. Based on the training dynamics, we prove the fast convergence but poor generalization of the learned transformer on the noisy dataset. We also show that Adam behaves similarly to SignGD in terms of both optimization and generalization in this setting. Additionally, we find that the poor generalization of SignGD is not solely due to data noise, suggesting that both SignGD and Adam requires high-quality data for real-world tasks. Finally, experiments on synthetic and real-world datasets empirically support our theoretical results.

* preprint

Via

Access Paper or Ask Questions

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

Oct 03, 2024

Jintao Zhang, Jia wei, Pengle Zhang, Jun Zhu, Jianfei Chen

Figure 1 for SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

Figure 2 for SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

Figure 3 for SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

Figure 4 for SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

Abstract:The transformer architecture predominates across various models. As the heart of the transformer, attention has a computational complexity of O(N^2), compared to O(N) for linear transformations. When handling large sequence lengths, attention becomes the primary time-consuming component. Although quantization has proven to be an effective method for accelerating model inference, existing quantization methods primarily focus on optimizing the linear layer. In response, we first analyze the feasibility of quantization in attention detailedly. Following that, we propose SageAttention, a highly efficient and accurate quantization method for attention. The OPS (operations per second) of our approach outperforms FlashAttention2 and xformers by about 2.1 times and 2.7 times, respectively. SageAttention also achieves superior accuracy performance over FlashAttention3. Comprehensive experiments confirm that our approach incurs almost no end-to-end metrics loss across diverse models, including those for large language processing, image generation, and video generation.

Via

Access Paper or Ask Questions

S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

Sep 13, 2024

Yuezhou Hu, Jun Zhu, Jianfei Chen

Figure 1 for S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

Figure 2 for S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

Figure 3 for S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

Figure 4 for S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

Abstract:Training deep neural networks (DNNs) is costly. Fortunately, Nvidia Ampere and Hopper GPUs can accelerate matrix multiplications twice as fast as a dense equivalent by implementing 2:4 sparsity. However, previous STE-based 2:4 pre-training methods (e.g. STE with hard-thresholding, SR-STE) suffer from optimization difficulties because of discontinuous pruning function. In this study, we comprehensively analyse the bottleneck of traditional N:M sparse training and recognize three drawbacks with discontinuity: incorrect descending direction, inability to predict the amount of descent and sparse mask oscillation. In the light of this statement, we propose S-STE, a simple yet powerful 2:4 training method that contains two parts: to continuously project weights to be 2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor. Besides, we adopt minimum-variance unbiased estimation for activation gradient and FP8 quantization for whole process. Results show that our method surpass previous 2:4 pre-training recipes and is comparable even with full parameter models.

Via

Access Paper or Ask Questions

1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit

Aug 26, 2024

Chang Gao, Jianfei Chen, Kang Zhao, Jiaqi Wang, Liping Jing

Figure 1 for 1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit

Figure 2 for 1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit

Figure 3 for 1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit

Figure 4 for 1-Bit FQT: Pushing the Limit of Fully Quantized Training to 1-bit

Abstract:Fully quantized training (FQT) accelerates the training of deep neural networks by quantizing the activations, weights, and gradients into lower precision. To explore the ultimate limit of FQT (the lowest achievable precision), we make a first attempt to 1-bit FQT. We provide a theoretical analysis of FQT based on Adam and SGD, revealing that the gradient variance influences the convergence of FQT. Building on these theoretical results, we introduce an Activation Gradient Pruning (AGP) strategy. The strategy leverages the heterogeneity of gradients by pruning less informative gradients and enhancing the numerical precision of remaining gradients to mitigate gradient variance. Additionally, we propose Sample Channel joint Quantization (SCQ), which utilizes different quantization strategies in the computation of weight gradients and activation gradients to ensure that the method is friendly to low-bitwidth hardware. Finally, we present a framework to deploy our algorithm. For fine-tuning VGGNet-16 and ResNet-18 on multiple datasets, our algorithm achieves an average accuracy improvement of approximately 6%, compared to per-sample quantization. Moreover, our training speedup can reach a maximum of 5.13x compared to full precision training.

Via

Access Paper or Ask Questions

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

Jul 30, 2024

Weiyu Huang, Guohao Jian, Yuezhou Hu, Jun Zhu, Jianfei Chen

Abstract:Transformer-based Large Language Models (LLMs) have demonstrated remarkable success across various challenging tasks. However, the deployment of LLMs is hindered by their substantial parameter count and memory consumption. Recently, numerous studies have attempted to compress LLMs by pruning them using training-free methods. However, these pruned models often experience significant performance degradation on complex tasks. To address this issue, we propose a novel training pipeline for semi-structured sparse models, named Adaptive Sparse Trainer (AST). By distilling the knowledge stored in its dense counterpart, we prevent the sparse model from overfitting and ensure a stable training process. Moreover, AST allows the model to adaptively select better lottery tickets (e.g., masks) during training. Additionally, we discovered that adding extra well-initialized parameters can further enhance model performance with only a small increase in memory footprint. Our method significantly narrows the performance gap between dense and sparse models while maintaining limited computational cost. Furthermore, when combined with existing quantization methods, AST can compress language models by up to 16x compared to dense FP32 precision models with minimal performance loss. AST outperforms previous state-of-the-art methods by reducing the zero-shot accuracy gap between dense and semi-structured sparse models to 1.12% across multiple zero-shot tasks on Llama2-7B, using less than 0.4% of the pretraining tokens.

Via

Access Paper or Ask Questions

Diffusion Bridge Implicit Models

May 24, 2024

Kaiwen Zheng, Guande He, Jianfei Chen, Fan Bao, Jun Zhu

Figure 1 for Diffusion Bridge Implicit Models

Figure 2 for Diffusion Bridge Implicit Models

Figure 3 for Diffusion Bridge Implicit Models

Figure 4 for Diffusion Bridge Implicit Models

Abstract:Denoising diffusion bridge models (DDBMs) are a powerful variant of diffusion models for interpolating between two arbitrary paired distributions given as endpoints. Despite their promising performance in tasks like image translation, DDBMs require a computationally intensive sampling process that involves the simulation of a (stochastic) differential equation through hundreds of network evaluations. In this work, we present diffusion bridge implicit models (DBIMs) for accelerated sampling of diffusion bridges without extra training. We generalize DDBMs via a class of non-Markovian diffusion bridges defined on the discretized timesteps concerning sampling, which share the same training objective as DDBMs. These generalized diffusion bridges give rise to generative processes ranging from stochastic to deterministic (i.e., an implicit probabilistic model) while being up to 25$\times$ faster than the vanilla sampler of DDBMs. Moreover, the deterministic sampling procedure yielded by DBIMs enables faithful encoding and reconstruction by a booting noise used in the initial sampling step, and allows us to perform semantically meaningful interpolation in image translation tasks by regarding the booting noise as the latent variable.

Via

Access Paper or Ask Questions

SparseDM: Toward Sparse Efficient Diffusion Models

Apr 16, 2024

Kafeng Wang, Jianfei Chen, He Li, Zhenpeng Mi, Jun Zhu

Abstract:Diffusion models have been extensively used in data generation tasks and are recognized as one of the best generative models. However, their time-consuming deployment, long inference time, and requirements on large memory limit their application on mobile devices. In this paper, we propose a method based on the improved Straight-Through Estimator to improve the deployment efficiency of diffusion models. Specifically, we add sparse masks to the Convolution and Linear layers in a pre-trained diffusion model, then use design progressive sparsity for model training in the fine-tuning stage, and switch the inference mask on and off, which supports a flexible choice of sparsity during inference according to the FID and MACs requirements. Experiments on four datasets conducted on a state-of-the-art Transformer-based diffusion model demonstrate that our method reduces MACs by $50\%$ while increasing FID by only 1.5 on average. Under other MACs conditions, the FID is also lower than 1$\sim$137 compared to other methods.

Via

Access Paper or Ask Questions