Alert button
Picture for Natalia Gimelshein

Natalia Gimelshein

Alert button

NVIDIA

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Dec 03, 2019
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, Soumith Chintala

Figure 1 for PyTorch: An Imperative Style, High-Performance Deep Learning Library
Figure 2 for PyTorch: An Imperative Style, High-Performance Deep Learning Library
Figure 3 for PyTorch: An Imperative Style, High-Performance Deep Learning Library

Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it provides an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs. In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. We demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several common benchmarks.

* 12 pages, 3 figures, NeurIPS 2019 
Viaarxiv icon

Online normalizer calculation for softmax

Jul 28, 2018
Maxim Milakov, Natalia Gimelshein

Figure 1 for Online normalizer calculation for softmax
Figure 2 for Online normalizer calculation for softmax
Figure 3 for Online normalizer calculation for softmax
Figure 4 for Online normalizer calculation for softmax

The Softmax function is ubiquitous in machine learning, multiple previous works suggested faster alternatives for it. In this paper we propose a way to compute classical Softmax with fewer memory accesses and hypothesize that this reduction in memory accesses should improve Softmax performance on actual hardware. The benchmarks confirm this hypothesis: Softmax accelerates by up to 1.3x and Softmax+TopK combined and fused by up to 5x.

* 1) Added link to the benchmark code, 2) Benchmarked Safe Softmax + Top-K fused and attributed part of 5x explicitly to fusion in sections 5.2 and 6, 3) Stylistic changes, 4) Minor clarifications 
Viaarxiv icon

vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design

Jul 28, 2016
Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, Stephen W. Keckler

Figure 1 for vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design
Figure 2 for vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design
Figure 3 for vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design
Figure 4 for vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design

The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction hampers a researcher's flexibility to study different machine learning algorithms, forcing them to either use a less desirable network architecture or parallelize the processing across multiple GPUs. We propose a runtime memory manager that virtualizes the memory usage of DNNs such that both GPU and CPU memory can simultaneously be utilized for training larger DNNs. Our virtualized DNN (vDNN) reduces the average GPU memory usage of AlexNet by up to 89%, OverFeat by 91%, and GoogLeNet by 95%, a significant reduction in memory requirements of DNNs. Similar experiments on VGG-16, one of the deepest and memory hungry DNNs to date, demonstrate the memory-efficiency of our proposal. vDNN enables VGG-16 with batch size 256 (requiring 28 GB of memory) to be trained on a single NVIDIA Titan X GPU card containing 12 GB of memory, with 18% performance loss compared to a hypothetical, oracular GPU with enough memory to hold the entire DNN.

* Published as a conference paper at the 49th IEEE/ACM International Symposium on Microarchitecture (MICRO-49), 2016 
Viaarxiv icon