Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tal Ben-Nun

Predicting Weather Uncertainty with Deep Convnets

Dec 04, 2019

Peter Grönquist, Tal Ben-Nun, Nikoli Dryden, Peter Dueben, Luca Lavarini, Shigang Li, Torsten Hoefler

Figure 1 for Predicting Weather Uncertainty with Deep Convnets

Figure 2 for Predicting Weather Uncertainty with Deep Convnets

Figure 3 for Predicting Weather Uncertainty with Deep Convnets

Figure 4 for Predicting Weather Uncertainty with Deep Convnets

Abstract:Modern weather forecast models perform uncertainty quantification using ensemble prediction systems, which collect nonparametric statistics based on multiple perturbed simulations. To provide accurate estimation, dozens of such computationally intensive simulations must be run. We show that deep neural networks can be used on a small set of numerical weather simulations to estimate the spread of a weather forecast, significantly reducing computational cost. To train the system, we both modify the 3D U-Net architecture and explore models that incorporate temporal data. Our models serve as a starting point to improve uncertainty quantification in current real-time weather forecasting systems, which is vital for predicting extreme events.

* Poster presentation at NeurIPS2019 "Machine Learning and the Physical Sciences" Workshop

Via

Access Paper or Ask Questions

Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

Aug 13, 2019

Shigang Li, Tal Ben-Nun, Salvatore Di Girolamo, Dan Alistarh, Torsten Hoefler

Figure 1 for Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

Figure 2 for Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

Figure 3 for Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

Figure 4 for Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

Abstract:Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.

Via

Access Paper or Ask Questions

Mix & Match: training convnets with mixed image sizes for improved accuracy, speed and scale resiliency

Aug 12, 2019

Elad Hoffer, Berry Weinstein, Itay Hubara, Tal Ben-Nun, Torsten Hoefler, Daniel Soudry

Figure 1 for Mix & Match: training convnets with mixed image sizes for improved accuracy, speed and scale resiliency

Figure 2 for Mix & Match: training convnets with mixed image sizes for improved accuracy, speed and scale resiliency

Figure 3 for Mix & Match: training convnets with mixed image sizes for improved accuracy, speed and scale resiliency

Figure 4 for Mix & Match: training convnets with mixed image sizes for improved accuracy, speed and scale resiliency

Abstract:Convolutional neural networks (CNNs) are commonly trained using a fixed spatial image size predetermined for a given model. Although trained on images of aspecific size, it is well established that CNNs can be used to evaluate a wide range of image sizes at test time, by adjusting the size of intermediate feature maps. In this work, we describe and evaluate a novel mixed-size training regime that mixes several image sizes at training time. We demonstrate that models trained using our method are more resilient to image size changes and generalize well even on small images. This allows faster inference by using smaller images attest time. For instance, we receive a 76.43% top-1 accuracy using ResNet50 with an image size of 160, which matches the accuracy of the baseline model with 2x fewer computations. Furthermore, for a given image size used at test time, we show this method can be exploited either to accelerate training or the final test accuracy. For example, we are able to reach a 79.27% accuracy with a model evaluated at a 288 spatial size for a relative improvement of 14% over the baseline.

Via

Access Paper or Ask Questions

A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning

Jan 29, 2019

Tal Ben-Nun, Maciej Besta, Simon Huber, Alexandros Nikolaos Ziogas, Daniel Peter, Torsten Hoefler

Figure 1 for A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning

Figure 2 for A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning

Figure 3 for A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning

Figure 4 for A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning

Abstract:We introduce Deep500: the first customizable benchmarking infrastructure that enables fair comparison of the plethora of deep learning frameworks, algorithms, libraries, and techniques. The key idea behind Deep500 is its modular design, where deep learning is factorized into four distinct levels: operators, network processing, training, and distributed training. Our evaluation illustrates that Deep500 is customizable (enables combining and benchmarking different deep learning codes) and fair (uses carefully selected metrics). Moreover, Deep500 is fast (incurs negligible overheads), verifiable (offers infrastructure to analyze correctness), and reproducible. Finally, as the first distributed and reproducible benchmarking system for deep learning, Deep500 provides software infrastructure to utilize the most powerful supercomputers for extreme-scale workloads.

* Accepted to IPDPS 2019

Via

Access Paper or Ask Questions

Augment your batch: better training with larger batches

Jan 27, 2019

Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten Hoefler, Daniel Soudry

Figure 1 for Augment your batch: better training with larger batches

Figure 2 for Augment your batch: better training with larger batches

Figure 3 for Augment your batch: better training with larger batches

Figure 4 for Augment your batch: better training with larger batches

Abstract:Large-batch SGD is important for scaling training of deep neural networks. However, without fine-tuning hyperparameter schedules, the generalization of the model may be hampered. We propose to use batch augmentation: replicating instances of samples within the same batch with different data augmentations. Batch augmentation acts as a regularizer and an accelerator, increasing both generalization and performance scaling. We analyze the effect of batch augmentation on gradient variance and show that it empirically improves convergence for a wide variety of deep neural networks and datasets. Our results show that batch augmentation reduces the number of necessary SGD updates to achieve the same accuracy as the state-of-the-art. Overall, this simple yet effective method enables faster training and better generalization by allowing more computational resources to be used concurrently.

Via

Access Paper or Ask Questions

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Sep 15, 2018

Tal Ben-Nun, Torsten Hoefler

Figure 1 for Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Figure 2 for Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Figure 3 for Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Figure 4 for Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Abstract:Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning.

Via

Access Paper or Ask Questions

Neural Code Comprehension: A Learnable Representation of Code Semantics

Jul 31, 2018

Tal Ben-Nun, Alice Shoshana Jakobovits, Torsten Hoefler

Figure 1 for Neural Code Comprehension: A Learnable Representation of Code Semantics

Figure 2 for Neural Code Comprehension: A Learnable Representation of Code Semantics

Figure 3 for Neural Code Comprehension: A Learnable Representation of Code Semantics

Figure 4 for Neural Code Comprehension: A Learnable Representation of Code Semantics

Abstract:With the recent success of embeddings in natural language processing, research has been conducted into applying similar methods to code analysis. Most works attempt to process the code directly or use a syntactic tree representation, treating it like sentences written in a natural language. However, none of the existing methods are sufficient to comprehend program semantics robustly, due to structural features such as function calls, branching, and interchangeable order of statements. In this paper, we propose a novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks. In particular, we stipulate that a robust distributional hypothesis of code applies to both human- and machine-generated programs. Following this hypothesis, we define an embedding space, inst2vec, based on an Intermediate Representation (IR) of the code that is independent of the source programming language. We provide a novel definition of contextual flow for this IR, leveraging both the underlying data- and control-flow of the program. We then analyze the embeddings qualitatively using analogies and clustering, and evaluate the learned representation on three different high-level tasks. We show that with a single RNN architecture and pre-trained fixed embeddings, inst2vec outperforms specialized approaches for performance prediction (compute device mapping, optimal thread coarsening); and algorithm classification from raw code (104 classes), where we set a new state-of-the-art.

Via

Access Paper or Ask Questions

μ-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching

Apr 13, 2018

Yosuke Oyama, Tal Ben-Nun, Torsten Hoefler, Satoshi Matsuoka

Figure 1 for μ-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching

Figure 2 for μ-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching

Figure 3 for μ-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching

Figure 4 for μ-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching

Abstract:NVIDIA cuDNN is a low-level library that provides GPU kernels frequently used in deep learning. Specifically, cuDNN implements several equivalent convolution algorithms, whose performance and memory footprint may vary considerably, depending on the layer dimensions. When an algorithm is automatically selected by cuDNN, the decision is performed on a per-layer basis, and thus it often resorts to slower algorithms that fit the workspace size constraints. We present {\mu}-cuDNN, a transparent wrapper library for cuDNN, which divides layers' mini-batch computation into several micro-batches. Based on Dynamic Programming and Integer Linear Programming, {\mu}-cuDNN enables faster algorithms by decreasing the workspace requirements. At the same time, {\mu}-cuDNN keeps the computational semantics unchanged, so that it decouples statistical efficiency from the hardware efficiency safely. We demonstrate the effectiveness of {\mu}-cuDNN over two frameworks, Caffe and TensorFlow, achieving speedups of 1.63x for AlexNet and 1.21x for ResNet-18 on P100-SXM2 GPU. These results indicate that using micro-batches can seamlessly increase the performance of deep learning, while maintaining the same memory footprint.

* 11 pages, 14 figures. Part of the content have been published in IPSJ SIG Technical Report, Vol. 2017-HPC-162, No. 22, pp. 1-9, 2017. (DOI: http://id.nii.ac.jp/1001/00184814)

Via

Access Paper or Ask Questions