Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hadi Esmaeilzadeh

Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation

Jan 23, 2020

Byung Hoon Ahn, Prannoy Pilligundla, Amir Yazdanbakhsh, Hadi Esmaeilzadeh

Figure 1 for Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation

Figure 2 for Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation

Figure 3 for Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation

Figure 4 for Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation

Abstract:Achieving faster execution with shorter compilation time can foster further diversity and innovation in neural networks. However, the current paradigm of executing neural networks either relies on hand-optimized libraries, traditional compilation heuristics, or very recently genetic algorithms and other stochastic methods. These methods suffer from frequent costly hardware measurements rendering them not only too time consuming but also suboptimal. As such, we devise a solution that can learn to quickly adapt to a previously unseen design space for code optimization, both accelerating the search and improving the output performance. This solution dubbed Chameleon leverages reinforcement learning whose solution takes fewer steps to converge, and develops an adaptive sampling algorithm that not only focuses on the costly samples (real hardware measurements) on representative points but also uses a domain-knowledge inspired logic to improve the samples itself. Experimentation with real hardware shows that Chameleon provides 4.45x speed up in optimization time over AutoTVM, while also improving inference time of the modern deep networks by 5.6%.

* Published as a conference paper at ICLR 2020. arXiv admin note: text overlap with arXiv:1905.12799

Via

Access Paper or Ask Questions

Divide and Conquer: Leveraging Intermediate Feature Representations for Quantized Training of Neural Networks

Jul 24, 2019

Ahmed T. Elthakeb, Prannoy Pilligundla, Hadi Esmaeilzadeh

Figure 1 for Divide and Conquer: Leveraging Intermediate Feature Representations for Quantized Training of Neural Networks

Figure 2 for Divide and Conquer: Leveraging Intermediate Feature Representations for Quantized Training of Neural Networks

Figure 3 for Divide and Conquer: Leveraging Intermediate Feature Representations for Quantized Training of Neural Networks

Figure 4 for Divide and Conquer: Leveraging Intermediate Feature Representations for Quantized Training of Neural Networks

Abstract:The deep layers of modern neural networks extract a rather rich set of features as an input propagates through the network. This paper sets out to harvest these rich intermediate representations for quantization with minimal accuracy loss while significantly reducing the memory footprint and compute intensity of the DNN. This paper utilizes knowledge distillation through teacher-student paradigm (Hinton et al., 2015) in a novel setting that exploits the feature extraction capability of DNNs for higher-accuracy quantization. As such, our algorithm logically divides a pretrained full-precision DNN to multiple sections, each of which exposes intermediate features to train a team of students independently in the quantized domain. This divide and conquer strategy, in fact, makes the training of each student section possible in isolation while all these independently trained sections are later stitched together to form the equivalent fully quantized network. Experiments on various DNNs (LeNet, ResNet-20, SVHN and VGG-11) show that, on average, this approach - called DCQ (Divide and Conquer Quantization) - achieves on average 9.7% accuracy improvement to a state-of-the-art quantized training technique, DoReFa (Zhou et al., 2016) for binary and ternary networks.

Via

Access Paper or Ask Questions

Reinforcement Learning and Adaptive Sampling for Optimized DNN Compilation

May 30, 2019

Byung Hoon Ahn, Prannoy Pilligundla, Hadi Esmaeilzadeh

Figure 1 for Reinforcement Learning and Adaptive Sampling for Optimized DNN Compilation

Figure 2 for Reinforcement Learning and Adaptive Sampling for Optimized DNN Compilation

Figure 3 for Reinforcement Learning and Adaptive Sampling for Optimized DNN Compilation

Figure 4 for Reinforcement Learning and Adaptive Sampling for Optimized DNN Compilation

Abstract:Achieving faster execution with shorter compilation time can enable further diversity and innovation in neural networks. However, the current paradigm of executing neural networks either relies on hand-optimized libraries, traditional compilation heuristics, or very recently, simulated annealing and genetic algorithms. Our work takes a unique approach by formulating compiler optimizations for neural networks as a reinforcement learning problem, whose solution takes fewer steps to converge. This solution, dubbed ReLeASE, comes with a sampling algorithm that leverages clustering to focus the costly samples (real hardware measurements) on representative points, subsuming an entire subspace. Our adaptive sampling not only reduces the number of samples, but also improves the quality of samples for better exploration in shorter time. As such, experimentation with real hardware shows that reinforcement learning with adaptive sampling provides 4.45x speed up in optimization time over AutoTVM, while also improving inference time of the modern deep networks by 5.6%. Further experiments also confirm that our adaptive sampling can even improve AutoTVM's simulated annealing by 4.00x.

Via

Access Paper or Ask Questions

Shredder: Learning Noise to Protect Privacy with Partial DNN Inference on the Edge

May 26, 2019

Fatemehsadat Mireshghallah, Mohammadkazem Taram, Prakash Ramrakhyani, Dean Tullsen, Hadi Esmaeilzadeh

Figure 1 for Shredder: Learning Noise to Protect Privacy with Partial DNN Inference on the Edge

Figure 2 for Shredder: Learning Noise to Protect Privacy with Partial DNN Inference on the Edge

Figure 3 for Shredder: Learning Noise to Protect Privacy with Partial DNN Inference on the Edge

Figure 4 for Shredder: Learning Noise to Protect Privacy with Partial DNN Inference on the Edge

Abstract:A wide variety of DNN applications increasingly rely on the cloud to perform their huge computation. This heavy trend toward cloud-hosted inference services raises serious privacy concerns. This model requires the sending of private and privileged data over the network to remote servers, exposing it to the service provider. Even if the provider is trusted, the data can still be vulnerable over communication channels or via side-channel attacks [1,2] at the provider. To that end, this paper aims to reduce the information content of the communicated data without compromising the cloud service's ability to provide a DNN inference with acceptably high accuracy. This paper presents an end-to-end framework, called Shredder, that, without altering the topology or the weights of a pre-trained network, learns an additive noise distribution that significantly reduces the information content of communicated data while maintaining the inference accuracy. Shredder learns the additive noise by casting it as a tensor of trainable parameters enabling us to devise a loss functions that strikes a balance between accuracy and information degradation. The loss function exposes a knob for a disciplined and controlled asymmetric trade-off between privacy and accuracy. While keeping the DNN intact, Shredder enables inference on noisy data without the need to update the model or the cloud. Experimentation with real-world DNNs shows that Shredder reduces the mutual information between the input and the communicated data to the cloud by 70.2% compared to the original execution while only sacrificing 1.46% loss in accuracy.

Via

Access Paper or Ask Questions

SinReQ: Generalized Sinusoidal Regularization for Automatic Low-Bitwidth Deep Quantized Training

May 04, 2019

Ahmed T. Elthakeb, Prannoy Pilligundla, Hadi Esmaeilzadeh

Figure 1 for SinReQ: Generalized Sinusoidal Regularization for Automatic Low-Bitwidth Deep Quantized Training

Figure 2 for SinReQ: Generalized Sinusoidal Regularization for Automatic Low-Bitwidth Deep Quantized Training

Figure 3 for SinReQ: Generalized Sinusoidal Regularization for Automatic Low-Bitwidth Deep Quantized Training

Figure 4 for SinReQ: Generalized Sinusoidal Regularization for Automatic Low-Bitwidth Deep Quantized Training

Abstract:Quantization of neural networks offers significant promise in reducing their compute and storage cost. Albeit alluring, without domain experts to come up with special handcrafted optimization techniques or ad-hoc manipulation of the original network architecture, deep quantization (below 8 bits) results in unrecoverable accuracy gap between the quantized model and the full-precision counterpart. We propose a novel sinusoidal regularization, dubbed SinReQ, for low precision deep quantized training. The proposed method is aimed at automatically yielding semi-quantized weights at pre-defined target bitwidths during conventional training. The proposed regularization is realized by adding a periodic function (sinusoidal regularizer) to the original objective function. We exploit the inherent periodicity with a desired convexity profile in sinusoidal functions to automatically propel weights towards target quantization levels during conventional training. Our method combines generality by providing the flexibility for arbitrary-bit quantization, and customization by optimizing different layer-wise regularizers simultaneously. Preliminary results for experiments on CIFAR10, SVHN show that integrating SinReQ within the training algorithm achieves 2.82%, and 2.11% accuracy improvements to DoReFa (Zhou et al., 2016), and WRPN (Mishra et al., 2018) methods respectively.

Via

Access Paper or Ask Questions

ReLeQ: An Automatic Reinforcement Learning Approach for Deep Quantization of Neural Networks

Dec 10, 2018

Amir Yazdanbakhsh, Ahmed T. Elthakeb, Prannoy Pilligundla, FatemehSadat Mireshghallah, Hadi Esmaeilzadeh

Figure 1 for ReLeQ: An Automatic Reinforcement Learning Approach for Deep Quantization of Neural Networks

Figure 2 for ReLeQ: An Automatic Reinforcement Learning Approach for Deep Quantization of Neural Networks

Figure 3 for ReLeQ: An Automatic Reinforcement Learning Approach for Deep Quantization of Neural Networks

Figure 4 for ReLeQ: An Automatic Reinforcement Learning Approach for Deep Quantization of Neural Networks

Abstract:Despite numerous state-of-the-art applications of Deep Neural Networks (DNNs) in a wide range of real-world tasks, two major challenges hinder further advances in DNNs: hyperparameter optimization and constrained power resources, which is a significant concern in embedded devices. DNNs become increasingly difficult to train and deploy as they grow in size due to both computational intensity and the large memory footprint. Recent efforts show that quantizing weights of deep neural networks to lower bitwidths takes a significant step toward mitigating the mentioned issues, by reducing memory bandwidth and using limited computational resources which is important for deploying DNN models to devices with limited resources. This paper builds upon the algorithmic insight that the bitwidth of operations in DNNs can be reduced without compromising their classification accuracy. Deep quantization (quantizing bitwidths below eight) while maintaining accuracy, requires magnificent manual effort and hyper-parameter tuning as well as re-training. This paper tackles the aforementioned problems by designing an end to end framework, dubbed ReLeQ, to automate DNN quantization. We formulate DNN quantization as an optimization problem and use a state-of-the-art policy gradient based Reinforcement Learning (RL) algorithm, Proximal Policy Optimization (PPO) to efficiently explore the large design space of DNN quantization and solve the defined optimization problem. To show the effectiveness of ReLeQ, we evaluated it across several neural networks including MNIST, CIFAR10, SVHN. ReLeQ quantizes the weights of these networks to average bitwidths of 2.25, 5 and 4 respectively while maintaining the final accuracy loss below 0.3%.

Via

Access Paper or Ask Questions

In-RDBMS Hardware Acceleration of Advanced Analytics

Sep 18, 2018

Divya Mahajan, Joon Kyung Kim, Jacob Sacks, Adel Ardalan, Arun Kumar, Hadi Esmaeilzadeh

Figure 1 for In-RDBMS Hardware Acceleration of Advanced Analytics

Figure 2 for In-RDBMS Hardware Acceleration of Advanced Analytics

Figure 3 for In-RDBMS Hardware Acceleration of Advanced Analytics

Figure 4 for In-RDBMS Hardware Acceleration of Advanced Analytics

Abstract:The data revolution is fueled by advances in machine learning, databases, and hardware design. Programmable accelerators are making their way into each of these areas independently. As such, there is a void of solutions that enables hardware acceleration at the intersection of these disjoint fields. This paper sets out to be the initial step towards a unifying solution for in-Database Acceleration of Advanced Analytics (DAnA). Deploying specialized hardware, such as FPGAs, for in-database analytics currently requires hand-designing the hardware and manually routing the data. Instead, DAnA automatically maps a high-level specification of advanced analytics queries to an FPGA accelerator. The accelerator implementation is generated for a User Defined Function (UDF), expressed as a part of an SQL query using a Python-embedded Domain-Specific Language (DSL). To realize an efficient in-database integration, DAnA accelerators contain a novel hardware structure, Striders, that directly interface with the buffer pool of the database. Striders extract, cleanse, and process the training data tuples that are consumed by a multi-threaded FPGA engine that executes the analytics algorithm. We integrate DAnA with PostgreSQL to generate hardware accelerators for a range of real-world and synthetic datasets running diverse ML algorithms. Results show that DAnA-enhanced PostgreSQL provides, on average, 8.3x end-to-end speedup for real datasets, with a maximum of 28.2x. Moreover, DAnA-enhanced PostgreSQL is, on average, 4.0x faster than the multi-threaded Apache MADLib running on Greenplum. DAnA provides these benefits while hiding the complexity of hardware design from data scientists and allowing them to express the algorithm in =30-60 lines of Python.

* Divya Mahajan, Joon Kyung Kim, Jacob Sacks, Adel Ardalan, Arun Kumar, and Hadi Esmaeilzadeh. In-RDBMS Hardware Acceleration of Advanced Analytics. PVLDB, 11(11): 1317-1331, 2018

Via

Access Paper or Ask Questions

Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

May 30, 2018

Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Joon Kyung Kim, Vikas Chandra, Hadi Esmaeilzadeh

Figure 1 for Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

Figure 2 for Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

Figure 3 for Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

Figure 4 for Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

Abstract:Fully realizing the potential of acceleration for Deep Neural Networks (DNNs) requires understanding and leveraging algorithmic properties. This paper builds upon the algorithmic insight that bitwidth of operations in DNNs can be reduced without compromising their classification accuracy. However, to prevent accuracy loss, the bitwidth varies significantly across DNNs and it may even be adjusted for each layer. Thus, a fixed-bitwidth accelerator would either offer limited benefits to accommodate the worst-case bitwidth requirements, or lead to a degradation in final accuracy. To alleviate these deficiencies, this work introduces dynamic bit-level fusion/decomposition as a new dimension in the design of DNN accelerators. We explore this dimension by designing Bit Fusion, a bit-flexible accelerator, that constitutes an array of bit-level processing elements that dynamically fuse to match the bitwidth of individual DNN layers. This flexibility in the architecture enables minimizing the computation and the communication at the finest granularity possible with no loss in accuracy. We evaluate the benefits of BitFusion using eight real-world feed-forward and recurrent DNNs. The proposed microarchitecture is implemented in Verilog and synthesized in 45 nm technology. Using the synthesis results and cycle accurate simulation, we compare the benefits of Bit Fusion to two state-of-the-art DNN accelerators, Eyeriss and Stripes. In the same area, frequency, and process technology, BitFusion offers 3.9x speedup and 5.1x energy savings over Eyeriss. Compared to Stripes, BitFusion provides 2.6x speedup and 3.9x energy reduction at 45 nm node when BitFusion area and frequency are set to those of Stripes. Scaling to GPU technology node of 16 nm, BitFusion almost matches the performance of a 250-Watt Titan Xp, which uses 8-bit vector instructions, while BitFusion merely consumes 895 milliwatts of power.

Via

Access Paper or Ask Questions

GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial Networks

May 10, 2018

Amir Yazdanbakhsh, Hajar Falahati, Philip J. Wolfe, Kambiz Samadi, Nam Sung Kim, Hadi Esmaeilzadeh

Figure 1 for GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial Networks

Figure 2 for GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial Networks

Figure 3 for GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial Networks

Figure 4 for GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial Networks

Abstract:Generative Adversarial Networks (GANs) are one of the most recent deep learning models that generate synthetic data from limited genuine datasets. GANs are on the frontier as further extension of deep learning into many domains (e.g., medicine, robotics, content synthesis) requires massive sets of labeled data that is generally either unavailable or prohibitively costly to collect. Although GANs are gaining prominence in various fields, there are no accelerators for these new models. In fact, GANs leverage a new operator, called transposed convolution, that exposes unique challenges for hardware acceleration. This operator first inserts zeros within the multidimensional input, then convolves a kernel over this expanded array to add information to the embedded zeros. Even though there is a convolution stage in this operator, the inserted zeros lead to underutilization of the compute resources when a conventional convolution accelerator is employed. We propose the GANAX architecture to alleviate the sources of inefficiency associated with the acceleration of GANs using conventional convolution accelerators, making the first GAN accelerator design possible. We propose a reorganization of the output computations to allocate compute rows with similar patterns of zeros to adjacent processing engines, which also avoids inconsequential multiply-adds on the zeros. This compulsory adjacency reclaims data reuse across these neighboring processing engines, which had otherwise diminished due to the inserted zeros. The reordering breaks the full SIMD execution model, which is prominent in convolution accelerators. Therefore, we propose a unified MIMD-SIMD design for GANAX that leverages repeated patterns in the computation to create distinct microprograms that execute concurrently in SIMD mode.

* Proceedings of the 45th International Symposium on Computer Architecture (ISCA), 2018

Via

Access Paper or Ask Questions