Alert button
Picture for Liang Luo

Liang Luo

Alert button

Amy

Pre-train and Search: Efficient Embedding Table Sharding with Pre-trained Neural Cost Models

May 03, 2023
Daochen Zha, Louis Feng, Liang Luo, Bhargav Bhushanam, Zirui Liu, Yusuo Hu, Jade Nie, Yuzhen Huang, Yuandong Tian, Arun Kejariwal, Xia Hu

Figure 1 for Pre-train and Search: Efficient Embedding Table Sharding with Pre-trained Neural Cost Models
Figure 2 for Pre-train and Search: Efficient Embedding Table Sharding with Pre-trained Neural Cost Models
Figure 3 for Pre-train and Search: Efficient Embedding Table Sharding with Pre-trained Neural Cost Models
Figure 4 for Pre-train and Search: Efficient Embedding Table Sharding with Pre-trained Neural Cost Models

Sharding a large machine learning model across multiple devices to balance the costs is important in distributed training. This is challenging because partitioning is NP-hard, and estimating the costs accurately and efficiently is difficult. In this work, we explore a "pre-train, and search" paradigm for efficient sharding. The idea is to pre-train a universal and once-for-all neural network to predict the costs of all the possible shards, which serves as an efficient sharding simulator. Built upon this pre-trained cost model, we then perform an online search to identify the best sharding plans given any specific sharding task. We instantiate this idea in deep learning recommendation models (DLRMs) and propose NeuroShard for embedding table sharding. NeuroShard pre-trains neural cost models on augmented tables to cover various sharding scenarios. Then it identifies the best column-wise and table-wise sharding plans with beam search and greedy grid search, respectively. Experiments show that NeuroShard significantly and consistently outperforms the state-of-the-art on the benchmark sharding dataset, achieving up to 23.8% improvement. When deployed in an ultra-large production DLRM with multi-terabyte embedding tables, NeuroShard achieves 11.6% improvement in embedding costs over the state-of-the-art, which translates to 6.6% end-to-end training throughput improvement. To facilitate future research of the "pre-train, and search" paradigm in ML for Systems, we open-source our code at https://github.com/daochenzha/neuroshard

* Accepted by MLSys 2023. Code available at https://github.com/daochenzha/neuroshard 
Viaarxiv icon

Self-discipline on multiple channels

Apr 27, 2023
Jiutian Zhao, Liang Luo, Hao Wang

Figure 1 for Self-discipline on multiple channels
Figure 2 for Self-discipline on multiple channels
Figure 3 for Self-discipline on multiple channels
Figure 4 for Self-discipline on multiple channels

Self-distillation relies on its own information to improve the generalization ability of the model and has a bright future. Existing self-distillation methods either require additional models, model modification, or batch size expansion for training, which increases the difficulty of use, memory consumption, and computational cost. This paper developed Self-discipline on multiple channels(SMC), which combines consistency regularization with self-distillation using the concept of multiple channels. Conceptually, SMC consists of two steps: 1) each channel data is simultaneously passed through the model to obtain its corresponding soft label, and 2) the soft label saved in the previous step is read together with the soft label obtained from the current channel data through the model to calculate the loss function. SMC uses consistent regularization and self-distillation to improve the generalization ability of the model and the robustness of the model to noisy labels. We named the SMC containing only two channels as SMC-2. Comparative experimental results on both datasets show that SMC-2 outperforms Label Smoothing Regularizaion and Self-distillation From The Last Mini-batch on all models, and outperforms the state-of-the-art Sharpness-Aware Minimization method on 83% of the models.Compatibility of SMC-2 and data augmentation experimental results show that using both SMC-2 and data augmentation improves the generalization ability of the model between 0.28% and 1.80% compared to using only data augmentation. Ultimately, the results of the label noise interference experiments show that SMC-2 curbs the tendency that the model's generalization ability decreases in the late training period due to the interference of label noise. The code is available at https://github.com/JiuTiannn/SMC-Self-discipline-on-multiple-channels.

Viaarxiv icon

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Apr 21, 2023
Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Shen Li

Figure 1 for PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Figure 2 for PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Figure 3 for PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Figure 4 for PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.

Viaarxiv icon

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction

Mar 11, 2022
Buyun Zhang, Liang Luo, Xi Liu, Jay Li, Zeliang Chen, Weilin Zhang, Xiaohan Wei, Yuchen Hao, Michael Tsang, Wenjun Wang, Yang Liu, Huayu Li, Yasmine Badr, Jongsoo Park, Jiyan Yang, Dheevatsa Mudigere, Ellie Wen

Figure 1 for DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction
Figure 2 for DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction
Figure 3 for DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction
Figure 4 for DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction

Learning feature interactions is important to the model performance of online advertising services. As a result, extensive efforts have been devoted to designing effective architectures to learn feature interactions. However, we observe that the practical performance of those designs can vary from dataset to dataset, even when the order of interactions claimed to be captured is the same. That indicates different designs may have different advantages and the interactions captured by them have non-overlapping information. Motivated by this observation, we propose DHEN - a deep and hierarchical ensemble architecture that can leverage strengths of heterogeneous interaction modules and learn a hierarchy of the interactions under different orders. To overcome the challenge brought by DHEN's deeper and multi-layer structure in training, we propose a novel co-designed training system that can further improve the training efficiency of DHEN. Experiments of DHEN on large-scale dataset from CTR prediction tasks attained 0.27\% improvement on the Normalized Entropy (NE) of prediction and 1.2x better training throughput than state-of-the-art baseline, demonstrating their effectiveness in practice.

Viaarxiv icon

Characterizing and Taming Resolution in Convolutional Neural Networks

Oct 28, 2021
Eddie Yan, Liang Luo, Luis Ceze

Figure 1 for Characterizing and Taming Resolution in Convolutional Neural Networks
Figure 2 for Characterizing and Taming Resolution in Convolutional Neural Networks
Figure 3 for Characterizing and Taming Resolution in Convolutional Neural Networks
Figure 4 for Characterizing and Taming Resolution in Convolutional Neural Networks

Image resolution has a significant effect on the accuracy and computational, storage, and bandwidth costs of computer vision model inference. These costs are exacerbated when scaling out models to large inference serving systems and make image resolution an attractive target for optimization. However, the choice of resolution inherently introduces additional tightly coupled choices, such as image crop size, image detail, and compute kernel implementation that impact computational, storage, and bandwidth costs. Further complicating this setting, the optimal choices from the perspective of these metrics are highly dependent on the dataset and problem scenario. We characterize this tradeoff space, quantitatively studying the accuracy and efficiency tradeoff via systematic and automated tuning of image resolution, image quality and convolutional neural network operators. With the insights from this study, we propose a dynamic resolution mechanism that removes the need to statically choose a resolution ahead of time.

Viaarxiv icon

Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering

May 28, 2021
Liang Luo, Jacob Nelson, Arvind Krishnamurthy, Luis Ceze

Figure 1 for Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering
Figure 2 for Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering
Figure 3 for Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering
Figure 4 for Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering

ML workloads are becoming increasingly popular in the cloud. Good cloud training performance is contingent on efficient parameter exchange among VMs. We find that Collectives, the widely used distributed communication algorithms, cannot perform optimally out of the box due to the hierarchical topology of datacenter networks and multi-tenancy nature of the cloudenvironment.In this paper, we present Cloud Collectives , a prototype that accelerates collectives by reordering theranks of participating VMs such that the communication pattern dictated by the selected collectives operation best exploits the locality in the network.Collectives is non-intrusive, requires no code changes nor rebuild of an existing application, and runs without support from cloud providers. Our preliminary application of Cloud Collectives on allreduce operations in public clouds results in a speedup of up to 3.7x in multiple microbenchmarks and 1.3x in real-world workloads of distributed training of deep neural networks and gradient boosted decision trees using state-of-the-art frameworks.

Viaarxiv icon

Accelerating SpMM Kernel with Cache-First Edge Sampling for Graph Neural Networks

Apr 23, 2021
Chien-Yu Lin, Liang Luo, Luis Ceze

Figure 1 for Accelerating SpMM Kernel with Cache-First Edge Sampling for Graph Neural Networks
Figure 2 for Accelerating SpMM Kernel with Cache-First Edge Sampling for Graph Neural Networks
Figure 3 for Accelerating SpMM Kernel with Cache-First Edge Sampling for Graph Neural Networks
Figure 4 for Accelerating SpMM Kernel with Cache-First Edge Sampling for Graph Neural Networks

Graph neural networks (GNNs), an emerging deep learning model class, can extract meaningful representations from highly expressive graph-structured data and are therefore gaining popularity for wider ranges of applications. However, current GNNs suffer from the poor performance of their sparse-dense matrix multiplication (SpMM) operator, even when using powerful GPUs. Our analysis shows that 95% of the inference time could be spent on SpMM when running popular GNN models on NVIDIA's advanced V100 GPU. Such SpMM performance bottleneck hinders GNNs' applicability to large-scale problems or the development of more sophisticated GNN models. To address this inference time bottleneck, we introduce ES-SpMM, a cache-first edge sampling mechanism and codesigned SpMM kernel. ES-SpMM uses edge sampling to downsize the graph to fit into GPU's shared memory. It thus reduces the computation cost and improves SpMM's cache locality. To evaluate ES-SpMM's performance, we integrated it with a popular GNN framework, DGL, and tested it using representative GNN models and datasets. Our results show that ES-SpMM outperforms the highly optimized cuSPARSE SpMM kernel by up to 4.35x with no accuracy loss and by 45.3x with less than a 1% accuracy loss.

Viaarxiv icon

High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models

Apr 15, 2021
Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie Amy Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yinbin Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishna Dhulipala, KR Kishore, Tyler Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry Chen, Manoj Krishnan, Avinash Nayak, Krishnakumar Nair, Bharath Muthiah, Mahmoud khorashadi, Pallab Bhattacharya, Petr Lapukhov, Maxim Naumov, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, Vijay Rao

Figure 1 for High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models
Figure 2 for High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models
Figure 3 for High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models
Figure 4 for High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models

Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.

Viaarxiv icon