Alert button
Picture for Gu-Yeon Wei

Gu-Yeon Wei

Alert button

A binary-activation, multi-level weight RNN and training algorithm for processing-in-memory inference with eNVM

Dec 03, 2019
Siming Ma, David Brooks, Gu-Yeon Wei

Figure 1 for A binary-activation, multi-level weight RNN and training algorithm for processing-in-memory inference with eNVM
Figure 2 for A binary-activation, multi-level weight RNN and training algorithm for processing-in-memory inference with eNVM
Figure 3 for A binary-activation, multi-level weight RNN and training algorithm for processing-in-memory inference with eNVM
Figure 4 for A binary-activation, multi-level weight RNN and training algorithm for processing-in-memory inference with eNVM

We present a new algorithm for training neural networks with binary activations and multi-level weights, which enables efficient processing-in-memory circuits with eNVM. Binary activations obviate costly DACs and ADCs. Multi-level weights leverage multi-level eNVM cells. Compared with previous quantization algorithms, our method not only works for feed-forward networks including fully-connected and convolutional, but also achieves higher accuracy and noise resilience for recurrent networks. In particular, we present a RNN trigger-word detection PIM accelerator, whose modeling results demonstrate high performance using our new training algorithm.

Viaarxiv icon

MLPerf Training Benchmark

Oct 30, 2019
Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debojyoti Dutta, Udit Gupta, Kim Hazelwood, Andrew Hock, Xinyuan Huang, Bill Jia, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Guokai Ma, Deepak Narayanan, Tayo Oguntebi, Gennady Pekhimenko, Lillian Pentecost, Vijay Janapa Reddi, Taylor Robie, Tom St. John, Carole-Jean Wu, Lingjie Xu, Cliff Young, Matei Zaharia

Figure 1 for MLPerf Training Benchmark
Figure 2 for MLPerf Training Benchmark
Figure 3 for MLPerf Training Benchmark
Figure 4 for MLPerf Training Benchmark

Machine learning is experiencing an explosion of software and hardware solutions, and needs industry-standard performance benchmarks to drive design and enable competitive evaluation. However, machine learning training presents a number of unique challenges to benchmarking that do not exist in other domains: (1) some optimizations that improve training throughput actually increase time to solution, (2) training is stochastic and time to solution has high variance, and (3) the software and hardware systems are so diverse that they cannot be fairly benchmarked with the same binary, code, or even hyperparameters. We present MLPerf, a machine learning benchmark that overcomes these challenges. We quantitatively evaluate the efficacy of MLPerf in driving community progress on performance and scalability across two rounds of results from multiple vendors.

Viaarxiv icon

AdaptivFloat: A Floating-point based Data Type for Resilient Deep Learning Inference

Oct 15, 2019
Thierry Tambe, En-Yu Yang, Zishen Wan, Yuntian Deng, Vijay Janapa Reddi, Alexander Rush, David Brooks, Gu-Yeon Wei

Figure 1 for AdaptivFloat: A Floating-point based Data Type for Resilient Deep Learning Inference
Figure 2 for AdaptivFloat: A Floating-point based Data Type for Resilient Deep Learning Inference
Figure 3 for AdaptivFloat: A Floating-point based Data Type for Resilient Deep Learning Inference
Figure 4 for AdaptivFloat: A Floating-point based Data Type for Resilient Deep Learning Inference

Conventional hardware-friendly quantization methods, such as fixed-point or integer, tend to perform poorly at very low word sizes as their shrinking dynamic ranges cannot adequately capture the wide data distributions commonly seen in sequence transduction models. We present AdaptivFloat, a floating-point inspired number representation format for deep learning that dynamically maximizes and optimally clips its available dynamic range, at a layer granularity, in order to create faithful encoding of neural network parameters. AdaptivFloat consistently produces higher inference accuracies compared to block floating-point, uniform, IEEE-like float or posit encodings at very low precision ($\leq$ 8-bit) across a diverse set of state-of-the-art neural network topologies. And notably, AdaptivFloat is seen surpassing baseline FP32 performance by up to +0.3 in BLEU score and -0.75 in word error rate at weight bit widths that are $\leq$ 8-bit. Experimental results on a deep neural network (DNN) hardware accelerator, exploiting AdaptivFloat logic in its computational datapath, demonstrate per-operation energy and area that is 0.9$\times$ and 1.14$\times$, respectively, that of equivalent bit width integer-based accelerator variants.

* 10 pages 
Viaarxiv icon

Benchmarking TPU, GPU, and CPU Platforms for Deep Learning

Aug 06, 2019
Yu Emma Wang, Gu-Yeon Wei, David Brooks

Figure 1 for Benchmarking TPU, GPU, and CPU Platforms for Deep Learning
Figure 2 for Benchmarking TPU, GPU, and CPU Platforms for Deep Learning
Figure 3 for Benchmarking TPU, GPU, and CPU Platforms for Deep Learning
Figure 4 for Benchmarking TPU, GPU, and CPU Platforms for Deep Learning

Training deep learning models is compute-intensive and there is an industry-wide trend towards hardware specialization to improve performance. To systematically benchmark deep learning platforms, we introduce ParaDnn, a parameterized benchmark suite for deep learning that generates end-to-end models for fully connected (FC), convolutional (CNN), and recurrent (RNN) neural networks. Along with six real-world models, we benchmark Google's Cloud TPU v2/v3, NVIDIA's V100 GPU, and an Intel Skylake CPU platform. We take a deep dive into TPU architecture, reveal its bottlenecks, and highlight valuable lessons learned for future specialized system design. We also provide a thorough comparison of the platforms and find that each has unique strengths for some types of models. Finally, we quantify the rapid performance improvements that specialized software stacks provide for the TPU and GPU platforms.

Viaarxiv icon

Learning Low-Rank Approximation for CNNs

May 24, 2019
Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Gu-Yeon Wei

Figure 1 for Learning Low-Rank Approximation for CNNs
Figure 2 for Learning Low-Rank Approximation for CNNs
Figure 3 for Learning Low-Rank Approximation for CNNs
Figure 4 for Learning Low-Rank Approximation for CNNs

Low-rank approximation is an effective model compression technique to not only reduce parameter storage requirements, but to also reduce computations. For convolutional neural networks (CNNs), however, well-known low-rank approximation methods, such as Tucker or CP decomposition, result in degraded model accuracy because decomposed layers hinder training convergence. In this paper, we propose a new training technique that finds a flat minimum in the view of low-rank approximation without a decomposed structure during training. By preserving the original model structure, 2-dimensional low-rank approximation demanding lowering (such as im2col) is available in our proposed scheme. We show that CNN models can be compressed by low-rank approximation with much higher compression ratio than conventional training methods while maintaining or even enhancing model accuracy. We also discuss various 2-dimensional low-rank approximation techniques for CNNs.

Viaarxiv icon

Structured Compression by Unstructured Pruning for Sparse Quantized Neural Networks

May 24, 2019
Se Jung Kwon, Dongsoo Lee, Byeongwook Kim, Parichay Kapoor, Baeseong Park, Gu-Yeon Wei

Figure 1 for Structured Compression by Unstructured Pruning for Sparse Quantized Neural Networks
Figure 2 for Structured Compression by Unstructured Pruning for Sparse Quantized Neural Networks
Figure 3 for Structured Compression by Unstructured Pruning for Sparse Quantized Neural Networks
Figure 4 for Structured Compression by Unstructured Pruning for Sparse Quantized Neural Networks

Model compression techniques, such as pruning and quantization, are becoming increasingly important to reduce the memory footprints and the amount of computations. Despite model size reduction, achieving performance enhancement on devices is, however, still challenging mainly due to the irregular representations of sparse matrix formats. This paper proposes a new representation to encode the weights of Sparse Quantized Neural Networks, specifically reduced by find-grained and unstructured pruning method. The representation is encoded in a structured regular format, which can be efficiently decoded through XOR gates during inference in a parallel manner. We demonstrate various deep learning models that can be compressed and represented by our proposed format with fixed and high compression ratio. For example, for fully-connected layers of AlexNet on ImageNet dataset, we can represent the sparse weights by only 0.09 bits/weight for 1-bit quantization and 91\% pruning rate with a fixed decoding rate and full memory bandwidth usage.

Viaarxiv icon

Network Pruning for Low-Rank Binary Indexing

May 14, 2019
Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Parichay Kapoor, Gu-Yeon Wei

Figure 1 for Network Pruning for Low-Rank Binary Indexing
Figure 2 for Network Pruning for Low-Rank Binary Indexing
Figure 3 for Network Pruning for Low-Rank Binary Indexing
Figure 4 for Network Pruning for Low-Rank Binary Indexing

Pruning is an efficient model compression technique to remove redundancy in the connectivity of deep neural networks (DNNs). Computations using sparse matrices obtained by pruning parameters, however, exhibit vastly different parallelism depending on the index representation scheme. As a result, fine-grained pruning has not gained much attention due to its irregular index form leading to large memory footprint and low parallelism for convolutions and matrix multiplications. In this paper, we propose a new network pruning technique that generates a low-rank binary index matrix to compress index data while decompressing index data is performed by simple binary matrix multiplication. This proposed compression method finds a particular fine-grained pruning mask that can be decomposed into two binary matrices. We also propose a tile-based factorization technique that not only lowers memory requirements but also enhances compression ratio. Various DNN models can be pruned with much fewer indexes compared to previous sparse matrix formats while maintaining the same pruning rate.

Viaarxiv icon