Alert button
Picture for Carole-Jean Wu

Carole-Jean Wu

Alert button

MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Oct 18, 2023
Samuel Hsia, Alicia Golden, Bilge Acun, Newsha Ardalani, Zachary DeVito, Gu-Yeon Wei, David Brooks, Carole-Jean Wu

Training and deploying large machine learning (ML) models is time-consuming and requires significant distributed computing infrastructures. Based on real-world large model training on datacenter-scale infrastructures, we show 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize the outstanding communication latency, in this work, we develop an agile performance modeling framework to guide parallelization and hardware-software co-design strategies. Using the suite of real-world large ML models on state-of-the-art GPU training hardware, we demonstrate 2.24x and 5.27x throughput improvement potential for pre-training and inference scenarios, respectively.

Viaarxiv icon

GEVO-ML: Optimizing Machine Learning Code with Evolutionary Computation

Oct 16, 2023
Jhe-Yu Liou, Stephanie Forrest, Carole-Jean Wu

Parallel accelerators, such as GPUs, are key enablers for large-scale Machine Learning (ML) applications. However, ML model developers often lack detailed knowledge of the underlying system architectures, while system programmers usually do not have a high-level understanding of the ML model that runs on the specific system. To mitigate this gap between two relevant aspects of domain knowledge, this paper proposes GEVO-ML, a tool for automatically discovering optimization opportunities and tuning the performance of ML kernels, where the model and training/prediction processes are uniformly represented in a single intermediate language, the Multiple-Layer Intermediate Representation (MLIR). GEVO-ML uses multi-objective evolutionary search to find edits (mutations) to MLIR code that ultimately runs on GPUs, improving performance on desired criteria while retaining required functionality. We demonstrate GEVO-ML on two different ML workloads for both model training and prediction. GEVO-ML finds significant Pareto improvements for these models, achieving 90.43% performance improvement when model accuracy is relaxed by 2%, from 91.2% to 89.3%. For the training workloads, GEVO-ML finds a 4.88% improvement in model accuracy, from 91% to 96%, without sacrificing training or testing speed. Our analysis of key GEVO-ML mutations reveals diverse code modifications, while might be foreign to human developers, achieving similar effects with how human developers improve model design, for example, by changing learning rates or pruning non-essential layer parameters.

Viaarxiv icon

READ: Recurrent Adaptation of Large Transformers

May 24, 2023
Sid Wang, John Nguyen, Ke Li, Carole-Jean Wu

Figure 1 for READ: Recurrent Adaptation of Large Transformers
Figure 2 for READ: Recurrent Adaptation of Large Transformers
Figure 3 for READ: Recurrent Adaptation of Large Transformers
Figure 4 for READ: Recurrent Adaptation of Large Transformers

Fine-tuning large-scale Transformers has led to the explosion of many AI applications across Natural Language Processing and Computer Vision tasks. However, fine-tuning all pre-trained model parameters becomes impractical as the model size and number of tasks increase. Parameter-efficient transfer learning (PETL) methods aim to address these challenges. While effective in reducing the number of trainable parameters, PETL methods still require significant energy and computational resources to fine-tune. In this paper, we introduce \textbf{RE}current \textbf{AD}aption (READ) -- a lightweight and memory-efficient fine-tuning method -- to overcome the limitations of the current PETL approaches. Specifically, READ inserts a small RNN network alongside the backbone model so that the model does not have to back-propagate through the large backbone network. Through comprehensive empirical evaluation of the GLUE benchmark, we demonstrate READ can achieve a $56\%$ reduction in the training memory consumption and an $84\%$ reduction in the GPU energy usage while retraining high model quality compared to full-tuning. Additionally, the model size of READ does not grow with the backbone model size, making it a highly scalable solution for fine-tuning large Transformers.

Viaarxiv icon

Green Federated Learning

Mar 26, 2023
Ashkan Yousefpour, Shen Guo, Ashish Shenoy, Sayan Ghosh, Pierre Stock, Kiwan Maeng, Schalk-Willem Krüger, Michael Rabbat, Carole-Jean Wu, Ilya Mironov

Figure 1 for Green Federated Learning
Figure 2 for Green Federated Learning
Figure 3 for Green Federated Learning
Figure 4 for Green Federated Learning

The rapid progress of AI is fueled by increasingly large and computationally intensive machine learning models and datasets. As a consequence, the amount of compute used in training state-of-the-art models is exponentially increasing (doubling every 10 months between 2015 and 2022), resulting in a large carbon footprint. Federated Learning (FL) - a collaborative machine learning technique for training a centralized model using data of decentralized entities - can also be resource-intensive and have a significant carbon footprint, particularly when deployed at scale. Unlike centralized AI that can reliably tap into renewables at strategically placed data centers, cross-device FL may leverage as many as hundreds of millions of globally distributed end-user devices with diverse energy sources. Green AI is a novel and important research area where carbon footprint is regarded as an evaluation criterion for AI, alongside accuracy, convergence speed, and other metrics. In this paper, we propose the concept of Green FL, which involves optimizing FL parameters and making design choices to minimize carbon emissions consistent with competitive performance and training time. The contributions of this work are two-fold. First, we adopt a data-driven approach to quantify the carbon emissions of FL by directly measuring real-world at-scale FL tasks running on millions of phones. Second, we present challenges, guidelines, and lessons learned from studying the trade-off between energy efficiency, performance, and time-to-train in a production FL system. Our findings offer valuable insights into how FL can reduce its carbon footprint, and they provide a foundation for future research in the area of Green AI.

Viaarxiv icon

Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference

Mar 10, 2023
Haiyang Huang, Newsha Ardalani, Anna Sun, Liu Ke, Hsien-Hsin S. Lee, Anjali Sridhar, Shruti Bhosale, Carole-Jean Wu, Benjamin Lee

Figure 1 for Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
Figure 2 for Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
Figure 3 for Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
Figure 4 for Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference

Mixture-of-Experts (MoE) models have recently gained steam in achieving the state-of-the-art performance in a wide range of tasks in computer vision and natural language processing. They effectively expand the model capacity while incurring a minimal increase in computation cost during training. However, deploying such models for inference is difficult due to their large model size and complex communication pattern. In this work, we provide a characterization of two MoE workloads, namely Language Modeling (LM) and Machine Translation (MT) and identify their sources of inefficiencies at deployment. We propose three optimization techniques to mitigate sources of inefficiencies, namely (1) Dynamic gating, (2) Expert Buffering, and (3) Expert load balancing. We show that dynamic gating improves execution time by 1.25-4$\times$ for LM, 2-5$\times$ for MT Encoder and 1.09-1.5$\times$ for MT Decoder. It also reduces memory usage by up to 1.36$\times$ for LM and up to 1.1$\times$ for MT. We further propose Expert Buffering, a new caching mechanism that only keeps hot, active experts in GPU memory while buffering the rest in CPU memory. This reduces static memory allocation by 1.47$\times$. We finally propose a load balancing methodology that provides additional robustness to the workload. The code will be open-sourced upon acceptance.

Viaarxiv icon

MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation

Feb 21, 2023
Samuel Hsia, Udit Gupta, Bilge Acun, Newsha Ardalani, Pan Zhong, Gu-Yeon Wei, David Brooks, Carole-Jean Wu

Figure 1 for MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation
Figure 2 for MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation
Figure 3 for MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation
Figure 4 for MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation

Deep learning recommendation systems serve personalized content under diverse tail-latency targets and input-query loads. In order to do so, state-of-the-art recommendation models rely on terabyte-scale embedding tables to learn user preferences over large bodies of contents. The reliance on a fixed embedding representation of embedding tables not only imposes significant memory capacity and bandwidth requirements but also limits the scope of compatible system solutions. This paper challenges the assumption of fixed embedding representations by showing how synergies between embedding representations and hardware platforms can lead to improvements in both algorithmic- and system performance. Based on our characterization of various embedding representations, we propose a hybrid embedding representation that achieves higher quality embeddings at the cost of increased memory and compute requirements. To address the system performance challenges of the hybrid representation, we propose MP-Rec -- a co-design technique that exploits heterogeneity and dynamic selection of embedding representations and underlying hardware platforms. On real system hardware, we demonstrate how matching custom accelerators, i.e., GPUs, TPUs, and IPUs, with compatible embedding representations can lead to 16.65x performance speedup. Additionally, in query-serving scenarios, MP-Rec achieves 2.49x and 3.76x higher correct prediction throughput and 0.19% and 0.22% better model quality on a CPU-GPU system for the Kaggle and Terabyte datasets, respectively.

Viaarxiv icon

FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models

Jan 08, 2023
Geet Sethi, Pallab Bhattacharya, Dhruv Choudhary, Carole-Jean Wu, Christos Kozyrakis

Figure 1 for FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models
Figure 2 for FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models
Figure 3 for FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models
Figure 4 for FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models

Sequence-based deep learning recommendation models (DLRMs) are an emerging class of DLRMs showing great improvements over their prior sum-pooling based counterparts at capturing users' long term interests. These improvements come at immense system cost however, with sequence-based DLRMs requiring substantial amounts of data to be dynamically materialized and communicated by each accelerator during a single iteration. To address this rapidly growing bottleneck, we present FlexShard, a new tiered sequence embedding table sharding algorithm which operates at a per-row granularity by exploiting the insight that not every row is equal. Through precise replication of embedding rows based on their underlying probability distribution, along with the introduction of a new sharding strategy adapted to the heterogeneous, skewed performance of real-world cluster network topologies, FlexShard is able to significantly reduce communication demand while using no additional memory compared to the prior state-of-the-art. When evaluated on production-scale sequence DLRMs, FlexShard was able to reduce overall global all-to-all communication traffic by over 85%, resulting in end-to-end training communication latency improvements of almost 6x over the prior state-of-the-art approach.

Viaarxiv icon

FedGPO: Heterogeneity-Aware Global Parameter Optimization for Efficient Federated Learning

Nov 30, 2022
Young Geun Kim, Carole-Jean Wu

Figure 1 for FedGPO: Heterogeneity-Aware Global Parameter Optimization for Efficient Federated Learning
Figure 2 for FedGPO: Heterogeneity-Aware Global Parameter Optimization for Efficient Federated Learning
Figure 3 for FedGPO: Heterogeneity-Aware Global Parameter Optimization for Efficient Federated Learning
Figure 4 for FedGPO: Heterogeneity-Aware Global Parameter Optimization for Efficient Federated Learning

Federated learning (FL) has emerged as a solution to deal with the risk of privacy leaks in machine learning training. This approach allows a variety of mobile devices to collaboratively train a machine learning model without sharing the raw on-device training data with the cloud. However, efficient edge deployment of FL is challenging because of the system/data heterogeneity and runtime variance. This paper optimizes the energy-efficiency of FL use cases while guaranteeing model convergence, by accounting for the aforementioned challenges. We propose FedGPO based on a reinforcement learning, which learns how to identify optimal global parameters (B, E, K) for each FL aggregation round adapting to the system/data heterogeneity and stochastic runtime variance. In our experiments, FedGPO improves the model convergence time by 2.4 times, and achieves 3.6 times higher energy efficiency over the baseline settings, respectively.

* 12 pages, 12 figures, IEEE International Symposium on Workload Characterization (IISWC) 
Viaarxiv icon

RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

Nov 14, 2022
Mark Zhao, Dhruv Choudhary, Devashish Tyagi, Ajay Somani, Max Kaplan, Sung-Han Lin, Sarunya Pumma, Jongsoo Park, Aarti Basant, Niket Agarwal, Carole-Jean Wu, Christos Kozyrakis

Figure 1 for RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure
Figure 2 for RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure
Figure 3 for RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure
Figure 4 for RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

We present RecD (Recommendation Deduplication), a suite of end-to-end infrastructure optimizations across the Deep Learning Recommendation Model (DLRM) training pipeline. RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets. Feature duplication arises because DLRM datasets are generated from interactions. While each user session can generate multiple training samples, many features' values do not change across these samples. We demonstrate how RecD exploits this property, end-to-end, across a deployed training pipeline. RecD optimizes data generation pipelines to decrease dataset storage and preprocessing resource demands and to maximize duplication within a training batch. RecD introduces a new tensor format, InverseKeyedJaggedTensors (IKJTs), to deduplicate feature values in each batch. We show how DLRM model architectures can leverage IKJTs to drastically increase training throughput. RecD improves the training and preprocessing throughput and storage efficiency by up to 2.49x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM training system.

Viaarxiv icon