Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tal Ben-Nun

A Data-Centric Optimization Framework for Machine Learning

Oct 20, 2021

Oliver Rausch, Tal Ben-Nun, Nikoli Dryden, Andrei Ivanov, Shigang Li, Torsten Hoefler

Figure 1 for A Data-Centric Optimization Framework for Machine Learning

Figure 2 for A Data-Centric Optimization Framework for Machine Learning

Figure 3 for A Data-Centric Optimization Framework for Machine Learning

Figure 4 for A Data-Centric Optimization Framework for Machine Learning

Abstract:Rapid progress in deep learning is leading to a diverse set of quickly changing models, with a dramatically growing demand for compute. However, as frameworks specialize optimization to patterns in popular networks, they implicitly constrain novel and diverse models that drive progress in research. We empower deep learning researchers by defining a flexible and user-customizable pipeline for optimizing training of arbitrary deep neural networks, based on data movement minimization. The pipeline begins with standard networks in PyTorch or ONNX and transforms computation through progressive lowering. We define four levels of general-purpose transformations, from local intra-operator optimizations to global data movement reduction. These operate on a data-centric graph intermediate representation that expresses computation and data movement at all levels of abstraction, including expanding basic operators such as convolutions to their underlying computations. Central to the design is the interactive and introspectable nature of the pipeline. Every part is extensible through a Python API, and can be tuned interactively using a GUI. We demonstrate competitive performance or speedups on ten different networks, with interactive optimizations discovering new opportunities in EfficientNet.

Via

Access Paper or Ask Questions

Learning Combinatorial Node Labeling Algorithms

Jun 15, 2021

Lukas Gianinazzi, Maximilian Fries, Nikoli Dryden, Tal Ben-Nun, Maciej Besta, Torsten Hoefler

Figure 1 for Learning Combinatorial Node Labeling Algorithms

Figure 2 for Learning Combinatorial Node Labeling Algorithms

Figure 3 for Learning Combinatorial Node Labeling Algorithms

Figure 4 for Learning Combinatorial Node Labeling Algorithms

Abstract:We present a graph neural network to learn graph coloring heuristics using reinforcement learning. Our learned deterministic heuristics give better solutions than classical degree-based greedy heuristics and only take seconds to evaluate on graphs with tens of thousands of vertices. As our approach is based on policy-gradients, it also learns a probabilistic policy as well. These probabilistic policies outperform all greedy coloring baselines and a machine learning baseline. Our approach generalizes several previous machine-learning frameworks, which applied to problems like minimum vertex cover. We also demonstrate that our approach outperforms two greedy heuristics on minimum vertex cover.

Via

Access Paper or Ask Questions

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Jan 31, 2021

Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, Alexandra Peste

Figure 1 for Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Figure 2 for Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Figure 3 for Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Figure 4 for Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks

Abstract:The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, if not better than, the original dense networks. Sparsity can reduce the memory footprint of regular networks to fit mobile devices, as well as shorten training time for ever growing networks. In this paper, we survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward. We include the necessary background on mathematical methods in sparsification, describe phenomena such as early structure adaptation, the intricate relations between sparsity and the training process, and show techniques for achieving acceleration on real hardware. We also define a metric of pruned parameter efficiency that could serve as a baseline for comparison of different sparse networks. We close by speculating on how sparsity can improve future workloads and outline major open problems in the field.

* 90 pages, 26 figures

Via

Access Paper or Ask Questions

Clairvoyant Prefetching for Distributed Machine Learning I/O

Jan 21, 2021

Roman Böhringer, Nikoli Dryden, Tal Ben-Nun, Torsten Hoefler

Figure 1 for Clairvoyant Prefetching for Distributed Machine Learning I/O

Figure 2 for Clairvoyant Prefetching for Distributed Machine Learning I/O

Figure 3 for Clairvoyant Prefetching for Distributed Machine Learning I/O

Figure 4 for Clairvoyant Prefetching for Distributed Machine Learning I/O

Abstract:I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments such as clouds and supercomputers. Optimal data ingestion pipelines differ between systems, and increasing efficiency requires a delicate balance between access to local storage, external filesystems, and remote workers; yet existing frameworks fail to efficiently utilize such resources. We observe that, given the seed generating the random access pattern for training with SGD, we have clairvoyance and can exactly predict when a given sample will be accessed. We combine this with a theoretical analysis of access patterns in training and performance modeling to produce a novel machine learning I/O middleware, HDMLP, to tackle the I/O bottleneck. HDMLP provides an easy-to-use, flexible, and scalable solution that delivers better performance than state-of-the-art approaches while requiring very few changes to existing codebases and supporting a broad range of environments.

* 15 pages, 11 figures

Via

Access Paper or Ask Questions

Deep Data Flow Analysis

Nov 21, 2020

Chris Cummins, Hugh Leather, Zacharias Fisches, Tal Ben-Nun, Torsten Hoefler, Michael O'Boyle

Abstract:Compiler architects increasingly look to machine learning when building heuristics for compiler optimization. The promise of automatic heuristic design, freeing the compiler engineer from the complex interactions of program, architecture, and other optimizations, is alluring. However, most machine learning methods cannot replicate even the simplest of the abstract interpretations of data flow analysis that are critical to making good optimization decisions. This must change for machine learning to become the dominant technology in compiler heuristics. To this end, we propose ProGraML - Program Graphs for Machine Learning - a language-independent, portable representation of whole-program semantics for deep learning. To benchmark current and future learning techniques for compiler analyses we introduce an open dataset of 461k Intermediate Representation (IR) files for LLVM, covering five source programming languages, and 15.4M corresponding data flow results. We formulate data flow analysis as an MPNN and show that, using ProGraML, standard analyses can be learned, yielding improved performance on downstream compiler optimization tasks.

* 9 pages, plus appendices. arXiv admin note: text overlap with arXiv:2003.10536

Via

Access Paper or Ask Questions

Data Movement Is All You Need: A Case Study on Optimizing Transformers

Jul 02, 2020

Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, Torsten Hoefler

Figure 1 for Data Movement Is All You Need: A Case Study on Optimizing Transformers

Figure 2 for Data Movement Is All You Need: A Case Study on Optimizing Transformers

Figure 3 for Data Movement Is All You Need: A Case Study on Optimizing Transformers

Figure 4 for Data Movement Is All You Need: A Case Study on Optimizing Transformers

Abstract:Transformers have become widely used for language modeling and sequence learning tasks, and are one of the most important machine learning workloads today. Training one is a very compute-intensive task, often taking days or weeks, and significant attention has been given to optimizing transformers. Despite this, existing implementations do not efficiently utilize GPUs. We find that data movement is the key bottleneck when training. Due to Amdahl's Law and massive improvements in compute performance, training has now become memory-bound. Further, existing frameworks use suboptimal data layouts. Using these insights, we present a recipe for globally optimizing data movement in transformers. We reduce data movement by up to 22.91% and overall achieve a 1.30x performance improvement over state-of-the-art frameworks when training BERT. Our approach is applicable more broadly to optimizing deep neural networks, and offers insight into how to tackle emerging performance bottlenecks.

* 15 pages, 6 figures; minor clarifications and style updates

Via

Access Paper or Ask Questions

Deep Learning for Post-Processing Ensemble Weather Forecasts

May 18, 2020

Peter Grönquist, Chengyuan Yao, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Shigang Li, Torsten Hoefler

Figure 1 for Deep Learning for Post-Processing Ensemble Weather Forecasts

Figure 2 for Deep Learning for Post-Processing Ensemble Weather Forecasts

Figure 3 for Deep Learning for Post-Processing Ensemble Weather Forecasts

Figure 4 for Deep Learning for Post-Processing Ensemble Weather Forecasts

Abstract:Quantifying uncertainty in weather forecasts typically employs ensemble prediction systems, which consist of many perturbed trajectories run in parallel. These systems are associated with a high computational cost and often include statistical post-processing steps to inexpensively improve their raw prediction qualities. We propose a mixed prediction and post-processing model based on a subset of the original trajectories. In the model, we implement methods from deep learning to account for non-linear relationships that are not captured by current numerical models or other post-processing methods. Applied to global data, our mixed models achieve a relative improvement of the ensemble forecast skill of over 13%. We demonstrate that this is especially the case for extreme weather events on selected case studies, where we see an improvement in predictions by up to 26%. In addition, by using only half the trajectories, the computational costs of ensemble prediction systems can potentially be reduced, allowing weather forecasting pipelines to run higher resolution trajectories, and resulting in even more accurate raw ensemble forecasts.

Via

Access Paper or Ask Questions

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

Apr 30, 2020

Shigang Li, Tal Ben-Nun, Dan Alistarh, Salvatore Di Girolamo, Nikoli Dryden, Torsten Hoefler

Figure 1 for Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

Figure 2 for Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

Figure 3 for Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

Figure 4 for Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

Abstract:Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates equivalent to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD, WAGMA-SGD significantly improves training throughput (by 2.1x on 1,024 GPUs) and achieves the fastest time-to-solution.

Via

Access Paper or Ask Questions

ProGraML: Graph-based Deep Learning for Program Optimization and Analysis

Mar 23, 2020

Chris Cummins, Zacharias V. Fisches, Tal Ben-Nun, Torsten Hoefler, Hugh Leather

Figure 1 for ProGraML: Graph-based Deep Learning for Program Optimization and Analysis

Figure 2 for ProGraML: Graph-based Deep Learning for Program Optimization and Analysis

Figure 3 for ProGraML: Graph-based Deep Learning for Program Optimization and Analysis

Figure 4 for ProGraML: Graph-based Deep Learning for Program Optimization and Analysis

Abstract:The increasing complexity of computing systems places a tremendous burden on optimizing compilers, requiring ever more accurate and aggressive optimizations. Machine learning offers significant benefits for constructing optimization heuristics but there remains a gap between what state-of-the-art methods achieve and the performance of an optimal heuristic. Closing this gap requires improvements in two key areas: a representation that accurately captures the semantics of programs, and a model architecture with sufficient expressiveness to reason about this representation. We introduce ProGraML - Program Graphs for Machine Learning - a novel graph-based program representation using a low level, language agnostic, and portable format; and machine learning models capable of performing complex downstream tasks over these graphs. The ProGraML representation is a directed attributed multigraph that captures control, data, and call relations, and summarizes instruction and operand types and ordering. Message Passing Neural Networks propagate information through this structured representation, enabling whole-program or per-vertex classification tasks. ProGraML provides a general-purpose program representation that equips learnable models to perform the types of program analysis that are fundamental to optimization. To this end, we evaluate the performance of our approach first on a suite of traditional compiler analysis tasks: control flow reachability, dominator trees, data dependencies, variable liveness, and common subexpression detection. On a benchmark dataset of 250k LLVM-IR files covering six source programming languages, ProGraML achieves an average 94.0 F1 score, significantly outperforming the state-of-the-art approaches. We then apply our approach to two high-level tasks - heterogeneous device mapping and program classification - setting new state-of-the-art performance in both.

* 20 pages, author preprint

Via

Access Paper or Ask Questions

Predicting Weather Uncertainty with Deep Convnets

Dec 04, 2019

Peter Grönquist, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Luca Lavarini, Shigang Li, Torsten Hoefler

Figure 1 for Predicting Weather Uncertainty with Deep Convnets

Figure 2 for Predicting Weather Uncertainty with Deep Convnets

Figure 3 for Predicting Weather Uncertainty with Deep Convnets

Figure 4 for Predicting Weather Uncertainty with Deep Convnets

Abstract:Modern weather forecast models perform uncertainty quantification using ensemble prediction systems, which collect nonparametric statistics based on multiple perturbed simulations. To provide accurate estimation, dozens of such computationally intensive simulations must be run. We show that deep neural networks can be used on a small set of numerical weather simulations to estimate the spread of a weather forecast, significantly reducing computational cost. To train the system, we both modify the 3D U-Net architecture and explore models that incorporate temporal data. Our models serve as a starting point to improve uncertainty quantification in current real-time weather forecasting systems, which is vital for predicting extreme events.

* Poster presentation at NeurIPS2019 "Machine Learning and the Physical Sciences" Workshop

Via

Access Paper or Ask Questions