Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yujun Lin

Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

Apr 25, 2022

Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Haotian Tang, Hanrui Wang, Ligeng Zhu, Song Han

Figure 1 for Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

Figure 2 for Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

Figure 3 for Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

Figure 4 for Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

Abstract:Deep neural networks (DNNs) have achieved unprecedented success in the field of artificial intelligence (AI), including computer vision, natural language processing and speech recognition. However, their superior performance comes at the considerable cost of computational complexity, which greatly hinders their applications in many resource-constrained devices, such as mobile phones and Internet of Things (IoT) devices. Therefore, methods and techniques that are able to lift the efficiency bottleneck while preserving the high accuracy of DNNs are in great demand in order to enable numerous edge AI applications. This paper provides an overview of efficient deep learning methods, systems and applications. We start from introducing popular model compression methods, including pruning, factorization, quantization as well as compact model design. To reduce the large design cost of these manual solutions, we discuss the AutoML framework for each of them, such as neural architecture search (NAS) and automated pruning and quantization. We then cover efficient on-device training to enable user customization based on the local data on mobile devices. Apart from general acceleration techniques, we also showcase several task-specific accelerations for point cloud, video and natural language processing by exploiting their spatial sparsity and temporal/token redundancy. Finally, to support all these algorithmic advancements, we introduce the efficient deep learning system design from both software and hardware perspectives.

* ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 27, Issue 3, Article 20, Page 1-50, 2021
* Journal preprint (ACM TODAES, 2021). The first seven authors contributed equally to this work and are listed in the alphabetical order

Via

Access Paper or Ask Questions

TorchSparse: Efficient Point Cloud Inference Engine

Apr 21, 2022

Haotian Tang, Zhijian Liu, Xiuyu Li, Yujun Lin, Song Han

Figure 1 for TorchSparse: Efficient Point Cloud Inference Engine

Figure 2 for TorchSparse: Efficient Point Cloud Inference Engine

Figure 3 for TorchSparse: Efficient Point Cloud Inference Engine

Figure 4 for TorchSparse: Efficient Point Cloud Inference Engine

Abstract:Deep learning on point clouds has received increased attention thanks to its wide applications in AR/VR and autonomous driving. These applications require low latency and high accuracy to provide real-time user experience and ensure user safety. Unlike conventional dense workloads, the sparse and irregular nature of point clouds poses severe challenges to running sparse CNNs efficiently on the general-purpose hardware. Furthermore, existing sparse acceleration techniques for 2D images do not translate to 3D point clouds. In this paper, we introduce TorchSparse, a high-performance point cloud inference engine that accelerates the sparse convolution computation on GPUs. TorchSparse directly optimizes the two bottlenecks of sparse convolution: irregular computation and data movement. It applies adaptive matrix multiplication grouping to trade computation for better regularity, achieving 1.4-1.5x speedup for matrix multiplication. It also optimizes the data movement by adopting vectorized, quantized and fused locality-aware memory access, reducing the memory movement cost by 2.7x. Evaluated on seven representative models across three benchmark datasets, TorchSparse achieves 1.6x and 1.5x measured end-to-end speedup over the state-of-the-art MinkowskiEngine and SpConv, respectively.

* MLSys 2022. The first three authors contributed equally to this work. Project page: https://torchsparse.mit.edu

Via

Access Paper or Ask Questions

QuantumNAS: Noise-Adaptive Search for Robust Quantum Circuits

Aug 02, 2021

Hanrui Wang, Yongshan Ding, Jiaqi Gu, Yujun Lin, David Z. Pan, Frederic T. Chong, Song Han

Figure 1 for QuantumNAS: Noise-Adaptive Search for Robust Quantum Circuits

Figure 2 for QuantumNAS: Noise-Adaptive Search for Robust Quantum Circuits

Figure 3 for QuantumNAS: Noise-Adaptive Search for Robust Quantum Circuits

Figure 4 for QuantumNAS: Noise-Adaptive Search for Robust Quantum Circuits

Abstract:Quantum noise is the key challenge in Noisy Intermediate-Scale Quantum (NISQ) computers. Previous work for mitigating noise has primarily focused on gate-level or pulse-level noise-adaptive compilation. However, limited research efforts have explored a higher level of optimization by making the quantum circuits themselves resilient to noise. We propose QuantumNAS, a comprehensive framework for noise-adaptive co-search of the variational circuit and qubit mapping. Variational quantum circuits are a promising approach for constructing QML and quantum simulation. However, finding the best variational circuit and its optimal parameters is challenging due to the large design space and parameter training cost. We propose to decouple the circuit search and parameter training by introducing a novel SuperCircuit. The SuperCircuit is constructed with multiple layers of pre-defined parameterized gates and trained by iteratively sampling and updating the parameter subsets (SubCircuits) of it. It provides an accurate estimation of SubCircuits performance trained from scratch. Then we perform an evolutionary co-search of SubCircuit and its qubit mapping. The SubCircuit performance is estimated with parameters inherited from SuperCircuit and simulated with real device noise models. Finally, we perform iterative gate pruning and finetuning to remove redundant gates. Extensively evaluated with 12 QML and VQE benchmarks on 10 quantum comput, QuantumNAS significantly outperforms baselines. For QML, QuantumNAS is the first to demonstrate over 95% 2-class, 85% 4-class, and 32% 10-class classification accuracy on real QC. It also achieves the lowest eigenvalue for VQE tasks on H2, H2O, LiH, CH4, BeH2 compared with UCCSD. We also open-source QuantumEngine (https://github.com/mit-han-lab/pytorch-quantum) for fast training of parameterized quantum circuits to facilitate future research.

* 14 pages, 22 figures. Code available at https://github.com/mit-han-lab/pytorch-quantum

Via

Access Paper or Ask Questions

NAAS: Neural Accelerator Architecture Search

May 27, 2021

Yujun Lin, Mengtian Yang, Song Han

Figure 1 for NAAS: Neural Accelerator Architecture Search

Figure 2 for NAAS: Neural Accelerator Architecture Search

Figure 3 for NAAS: Neural Accelerator Architecture Search

Figure 4 for NAAS: Neural Accelerator Architecture Search

Abstract:Data-driven, automatic design space exploration of neural accelerator architecture is desirable for specialization and productivity. Previous frameworks focus on sizing the numerical architectural hyper-parameters while neglect searching the PE connectivities and compiler mappings. To tackle this challenge, we propose Neural Accelerator Architecture Search (NAAS) which holistically searches the neural network architecture, accelerator architecture, and compiler mapping in one optimization loop. NAAS composes highly matched architectures together with efficient mapping. As a data-driven approach, NAAS rivals the human design Eyeriss by 4.4x EDP reduction with 2.7% accuracy improvement on ImageNet under the same computation resource, and offers 1.4x to 3.5x EDP reduction than only sizing the architectural hyper-parameters.

* Accepted by DAC'21

Via

Access Paper or Ask Questions

Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution

Aug 13, 2020

Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji Lin, Hanrui Wang, Song Han

Figure 1 for Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution

Figure 2 for Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution

Figure 3 for Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution

Figure 4 for Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution

Abstract:Self-driving cars need to understand 3D scenes efficiently and accurately in order to drive safely. Given the limited hardware resources, existing 3D perception models are not able to recognize small instances (e.g., pedestrians, cyclists) very well due to the low-resolution voxelization and aggressive downsampling. To this end, we propose Sparse Point-Voxel Convolution (SPVConv), a lightweight 3D module that equips the vanilla Sparse Convolution with the high-resolution point-based branch. With negligible overhead, this point-based branch is able to preserve the fine details even from large outdoor scenes. To explore the spectrum of efficient 3D models, we first define a flexible architecture design space based on SPVConv, and we then present 3D Neural Architecture Search (3D-NAS) to search the optimal network architecture over this diverse design space efficiently and effectively. Experimental results validate that the resulting SPVNAS model is fast and accurate: it outperforms the state-of-the-art MinkowskiNet by 3.3%, ranking 1st on the competitive SemanticKITTI leaderboard. It also achieves 8x computation reduction and 3x measured speedup over MinkowskiNet with higher accuracy. Finally, we transfer our method to 3D object detection, and it achieves consistent improvements over the one-stage detection baseline on KITTI.

* ECCV 2020. The first two authors contributed equally to this work. Project page: http://spvnas.mit.edu/

Via

Access Paper or Ask Questions

Hardware-Centric AutoML for Mixed-Precision Quantization

Aug 11, 2020

Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, Song Han

Figure 1 for Hardware-Centric AutoML for Mixed-Precision Quantization

Figure 2 for Hardware-Centric AutoML for Mixed-Precision Quantization

Figure 3 for Hardware-Centric AutoML for Mixed-Precision Quantization

Figure 4 for Hardware-Centric AutoML for Mixed-Precision Quantization

Abstract:Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency, which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off among accuracy, latency, energy, and model size, which is both time-consuming and sub-optimal. Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way. In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy, and we take the hardware accelerator's feedback in the design loop. Rather than relying on proxy signals such as FLOPs and model size, we employ a hardware simulator to generate direct feedback signals (latency and energy) to the RL agent. Compared with conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures. Our framework effectively reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization. Our framework reveals that the optimal policies on different hardware architectures (i.e., edge and cloud architectures) under different resource constraints (i.e., latency, energy, and model size) are drastically different. We interpreted the implication of different quantization policies, which offer insights for both neural network architecture design and hardware architecture design.

* International Journal of Computer Vision (IJCV), 2020
* Journal preprint of arXiv:1811.08886 (IJCV, 2020). The first three authors contributed equally to this work. Project page: https://hanlab.mit.edu/projects/haq/

Via

Access Paper or Ask Questions

MCUNet: Tiny Deep Learning on IoT Devices

Jul 20, 2020

Ji Lin, Wei-Ming Chen, Yujun Lin, John Cohn, Chuang Gan, Song Han

Figure 1 for MCUNet: Tiny Deep Learning on IoT Devices

Figure 2 for MCUNet: Tiny Deep Learning on IoT Devices

Figure 3 for MCUNet: Tiny Deep Learning on IoT Devices

Figure 4 for MCUNet: Tiny Deep Learning on IoT Devices

Abstract:Machine learning on tiny IoT devices based on microcontroller units (MCU) is appealing but challenging: the memory of microcontrollers is 2-3 orders of magnitude less even than mobile phones. We propose MCUNet, a framework that jointly designs the efficient neural architecture (TinyNAS) and the lightweight inference engine (TinyEngine), enabling ImageNet-scale inference on microcontrollers. TinyNAS adopts a two-stage neural architecture search approach that first optimizes the search space to fit the resource constraints, then specializes the network architecture in the optimized search space. TinyNAS can automatically handle diverse constraints (i.e. device, latency, energy, memory) under low search costs. TinyNAS is co-designed with TinyEngine, a memory-efficient inference library to expand the design space and fit a larger model. TinyEngine adapts the memory scheduling according to the overall network topology rather than layer-wise optimization, reducing the memory usage by 2.7x, and accelerating the inference by 1.7-3.3x compared to TF-Lite Micro and CMSIS-NN. MCUNet is the first to achieves >70% ImageNet top1 accuracy on an off-the-shelf commercial microcontroller, using 3.6x less SRAM and 6.6x less Flash compared to quantized MobileNetV2 and ResNet-18. On visual&audio wake words tasks, MCUNet achieves state-of-the-art accuracy and runs 2.4-3.4x faster than MobileNetV2 and ProxylessNAS-based solutions with 2.2-2.6x smaller peak SRAM. Our study suggests that the era of always-on tiny machine learning on IoT devices has arrived.

* Demo video available here: https://youtu.be/YvioBgtec4U

Via

Access Paper or Ask Questions

Lite Transformer with Long-Short Range Attention

Apr 24, 2020

Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, Song Han

Figure 1 for Lite Transformer with Long-Short Range Attention

Figure 2 for Lite Transformer with Long-Short Range Attention

Figure 3 for Lite Transformer with Long-Short Range Attention

Figure 4 for Lite Transformer with Long-Short Range Attention

Abstract:Transformer has become ubiquitous in natural language processing (e.g., machine translation, question answering); however, it requires enormous amount of computations to achieve high performance, which makes it not suitable for mobile applications that are tightly constrained by the hardware resources and battery. In this paper, we present an efficient mobile NLP architecture, Lite Transformer to facilitate deploying mobile NLP applications on edge devices. The key primitive is the Long-Short Range Attention (LSRA), where one group of heads specializes in the local context modeling (by convolution) while another group specializes in the long-distance relationship modeling (by attention). Such specialization brings consistent improvement over the vanilla transformer on three well-established language tasks: machine translation, abstractive summarization, and language modeling. Under constrained resources (500M/100M MACs), Lite Transformer outperforms transformer on WMT'14 English-French by 1.2/1.7 BLEU, respectively. Lite Transformer reduces the computation of transformer base model by 2.5x with 0.3 BLEU score degradation. Combining with pruning and quantization, we further compressed the model size of Lite Transformer by 18.2x. For language modeling, Lite Transformer achieves 1.8 lower perplexity than the transformer at around 500M MACs. Notably, Lite Transformer outperforms the AutoML-based Evolved Transformer by 0.5 higher BLEU for the mobile NLP setting without the costly architecture search that requires more than 250 GPU years. Code has been made available at https://github.com/mit-han-lab/lite-transformer.

* ICLR 2020. The first two authors contributed equally to this work

Via

Access Paper or Ask Questions

Point-Voxel CNN for Efficient 3D Deep Learning

Jul 08, 2019

Zhijian Liu, Haotian Tang, Yujun Lin, Song Han

Figure 1 for Point-Voxel CNN for Efficient 3D Deep Learning

Figure 2 for Point-Voxel CNN for Efficient 3D Deep Learning

Figure 3 for Point-Voxel CNN for Efficient 3D Deep Learning

Figure 4 for Point-Voxel CNN for Efficient 3D Deep Learning

Abstract:We present Point-Voxel CNN (PVCNN) for efficient, fast 3D deep learning. Previous work processes 3D data using either voxel-based or point-based NN models. However, both approaches are computationally inefficient. The computation cost and memory footprints of the voxel-based models grow cubically with the input resolution, making it memory-prohibitive to scale up the resolution. As for point-based networks, up to 80% of the time is wasted on structuring the irregular data which have rather poor memory locality, not on the actual feature extraction. In this paper, we propose PVCNN that represents the 3D input data in points to reduce the memory consumption, while performing the convolutions in voxels to largely reduce the irregular data access and improve the locality. Our PVCNN model is both memory and computation efficient. Evaluated on semantic and part segmentation datasets, it achieves much higher accuracy than the voxel-based baseline with 10x GPU memory reduction; it also outperforms the state-of-the-art point-based models with 7x measured speedup on average. Remarkably, narrower version of PVCNN achieves 2x speedup over PointNet (an extremely efficient model) on part and scene segmentation benchmarks with much higher accuracy. We validate the general effectiveness of our PVCNN on 3D object detection: by replacing the primitives in Frustrum PointNet with PVConv, it outperforms Frustrum PointNet++ by 2.4% mAP on average with 1.5x measured speedup and GPU memory reduction.

* The first two authors contributed equally to this work

Via

Access Paper or Ask Questions

Design Automation for Efficient Deep Learning Computing

Apr 24, 2019

Song Han, Han Cai, Ligeng Zhu, Ji Lin, Kuan Wang, Zhijian Liu, Yujun Lin

Figure 1 for Design Automation for Efficient Deep Learning Computing

Figure 2 for Design Automation for Efficient Deep Learning Computing

Figure 3 for Design Automation for Efficient Deep Learning Computing

Figure 4 for Design Automation for Efficient Deep Learning Computing

Abstract:Efficient deep learning computing requires algorithm and hardware co-design to enable specialization: we usually need to change the algorithm to reduce memory footprint and improve energy efficiency. However, the extra degree of freedom from the algorithm makes the design space much larger: it's not only about designing the hardware but also about how to tweak the algorithm to best fit the hardware. Human engineers can hardly exhaust the design space by heuristics. It's labor consuming and sub-optimal. We propose design automation techniques for efficient neural networks. We investigate automatically designing specialized fast models, auto channel pruning, and auto mixed-precision quantization. We demonstrate such learning-based, automated design achieves superior performance and efficiency than rule-based human design. Moreover, we shorten the design cycle by 200x than previous work, so that we can afford to design specialized neural network models for different hardware platforms.

Via

Access Paper or Ask Questions