Alert button
Picture for Jianlei Yang

Jianlei Yang

Alert button

Lossy and Lossless (L$^2$) Post-training Model Size Compression

Aug 08, 2023
Yumeng Shi, Shihao Bai, Xiuying Wei, Ruihao Gong, Jianlei Yang

Figure 1 for Lossy and Lossless (L$^2$) Post-training Model Size Compression
Figure 2 for Lossy and Lossless (L$^2$) Post-training Model Size Compression
Figure 3 for Lossy and Lossless (L$^2$) Post-training Model Size Compression
Figure 4 for Lossy and Lossless (L$^2$) Post-training Model Size Compression

Deep neural networks have delivered remarkable performance and have been widely used in various visual tasks. However, their huge size causes significant inconvenience for transmission and storage. Many previous studies have explored model size compression. However, these studies often approach various lossy and lossless compression methods in isolation, leading to challenges in achieving high compression ratios efficiently. This work proposes a post-training model size compression method that combines lossy and lossless compression in a unified way. We first propose a unified parametric weight transformation, which ensures different lossy compression methods can be performed jointly in a post-training manner. Then, a dedicated differentiable counter is introduced to guide the optimization of lossy compression to arrive at a more suitable point for later lossless compression. Additionally, our method can easily control a desired global compression ratio and allocate adaptive ratios for different layers. Finally, our method can achieve a stable $10\times$ compression ratio without sacrificing accuracy and a $20\times$ compression ratio with minor accuracy loss in a short time. Our code is available at https://github.com/ModelTC/L2_Compression .

Viaarxiv icon

Hardware-Aware Graph Neural Network Automated Design for Edge Computing Platforms

Mar 20, 2023
Ao Zhou, Jianlei Yang, Yingjie Qi, Yumeng Shi, Tong Qiao, Weisheng Zhao, Chunming Hu

Figure 1 for Hardware-Aware Graph Neural Network Automated Design for Edge Computing Platforms
Figure 2 for Hardware-Aware Graph Neural Network Automated Design for Edge Computing Platforms
Figure 3 for Hardware-Aware Graph Neural Network Automated Design for Edge Computing Platforms
Figure 4 for Hardware-Aware Graph Neural Network Automated Design for Edge Computing Platforms

Graph neural networks (GNNs) have emerged as a popular strategy for handling non-Euclidean data due to their state-of-the-art performance. However, most of the current GNN model designs mainly focus on task accuracy, lacking in considering hardware resources limitation and real-time requirements of edge application scenarios. Comprehensive profiling of typical GNN models indicates that their execution characteristics are significantly affected across different computing platforms, which demands hardware awareness for efficient GNN designs. In this work, HGNAS is proposed as the first Hardware-aware Graph Neural Architecture Search framework targeting resource constraint edge devices. By decoupling the GNN paradigm, HGNAS constructs a fine-grained design space and leverages an efficient multi-stage search strategy to explore optimal architectures within a few GPU hours. Moreover, HGNAS achieves hardware awareness during the GNN architecture design by leveraging a hardware performance predictor, which could balance the GNN model accuracy and efficiency corresponding to the characteristics of targeted devices. Experimental results show that HGNAS can achieve about $10.6\times$ speedup and $88.2\%$ peak memory reduction with a negligible accuracy loss compared to DGCNN on various edge devices, including Nvidia RTX3080, Jetson TX2, Intel i7-8700K and Raspberry Pi 3B+.

* Accepted by DAC'23 
Viaarxiv icon

Eventor: An Efficient Event-Based Monocular Multi-View Stereo Accelerator on FPGA Platform

Mar 29, 2022
Mingjun Li, Jianlei Yang, Yingjie Qi, Meng Dong, Yuhao Yang, Runze Liu, Weitao Pan, Bei Yu, Weisheng Zhao

Figure 1 for Eventor: An Efficient Event-Based Monocular Multi-View Stereo Accelerator on FPGA Platform
Figure 2 for Eventor: An Efficient Event-Based Monocular Multi-View Stereo Accelerator on FPGA Platform
Figure 3 for Eventor: An Efficient Event-Based Monocular Multi-View Stereo Accelerator on FPGA Platform
Figure 4 for Eventor: An Efficient Event-Based Monocular Multi-View Stereo Accelerator on FPGA Platform

Event cameras are bio-inspired vision sensors that asynchronously represent pixel-level brightness changes as event streams. Event-based monocular multi-view stereo (EMVS) is a technique that exploits the event streams to estimate semi-dense 3D structure with known trajectory. It is a critical task for event-based monocular SLAM. However, the required intensive computation workloads make it challenging for real-time deployment on embedded platforms. In this paper, Eventor is proposed as a fast and efficient EMVS accelerator by realizing the most critical and time-consuming stages including event back-projection and volumetric ray-counting on FPGA. Highly paralleled and fully pipelined processing elements are specially designed via FPGA and integrated with the embedded ARM as a heterogeneous system to improve the throughput and reduce the memory footprint. Meanwhile, the EMVS algorithm is reformulated to a more hardware-friendly manner by rescheduling, approximate computing and hybrid data quantization. Evaluation results on DAVIS dataset show that Eventor achieves up to $24\times$ improvement in energy efficiency compared with Intel i5 CPU platform.

* 6 pages, accepted by DAC 2022 
Viaarxiv icon

FedSkel: Efficient Federated Learning on Heterogeneous Systems with Skeleton Gradients Update

Aug 20, 2021
Junyu Luo, Jianlei Yang, Xucheng Ye, Xin Guo, Weisheng Zhao

Figure 1 for FedSkel: Efficient Federated Learning on Heterogeneous Systems with Skeleton Gradients Update
Figure 2 for FedSkel: Efficient Federated Learning on Heterogeneous Systems with Skeleton Gradients Update
Figure 3 for FedSkel: Efficient Federated Learning on Heterogeneous Systems with Skeleton Gradients Update
Figure 4 for FedSkel: Efficient Federated Learning on Heterogeneous Systems with Skeleton Gradients Update

Federated learning aims to protect users' privacy while performing data analysis from different participants. However, it is challenging to guarantee the training efficiency on heterogeneous systems due to the various computational capabilities and communication bottlenecks. In this work, we propose FedSkel to enable computation-efficient and communication-efficient federated learning on edge devices by only updating the model's essential parts, named skeleton networks. FedSkel is evaluated on real edge devices with imbalanced datasets. Experimental results show that it could achieve up to 5.52$\times$ speedups for CONV layers' back-propagation, 1.82$\times$ speedups for the whole training process, and reduce 64.8% communication cost, with negligible accuracy loss.

* CIKM 2021 
Viaarxiv icon

S2Engine: A Novel Systolic Architecture for Sparse Convolutional Neural Networks

Jun 15, 2021
Jianlei Yang, Wenzhi Fu, Xingzhou Cheng, Xucheng Ye, Pengcheng Dai, Weisheng Zhao

Figure 1 for S2Engine: A Novel Systolic Architecture for Sparse Convolutional Neural Networks
Figure 2 for S2Engine: A Novel Systolic Architecture for Sparse Convolutional Neural Networks
Figure 3 for S2Engine: A Novel Systolic Architecture for Sparse Convolutional Neural Networks
Figure 4 for S2Engine: A Novel Systolic Architecture for Sparse Convolutional Neural Networks

Convolutional neural networks (CNNs) have achieved great success in performing cognitive tasks. However, execution of CNNs requires a large amount of computing resources and generates heavy memory traffic, which imposes a severe challenge on computing system design. Through optimizing parallel executions and data reuse in convolution, systolic architecture demonstrates great advantages in accelerating CNN computations. However, regular internal data transmission path in traditional systolic architecture prevents the systolic architecture from completely leveraging the benefits introduced by neural network sparsity. Deployment of fine-grained sparsity on the existing systolic architectures is greatly hindered by the incurred computational overheads. In this work, we propose S2Engine $-$ a novel systolic architecture that can fully exploit the sparsity in CNNs with maximized data reuse. S2Engine transmits compressed data internally and allows each processing element to dynamically select an aligned data from the compressed dataflow in convolution. Compared to the naive systolic array, S2Engine achieves about $3.2\times$ and about $3.0\times$ improvements on speed and energy efficiency, respectively.

* IEEE Transactions on Computers, 2021  
* 13 pages, 17 figures 
Viaarxiv icon

RoSearch: Search for Robust Student Architectures When Distilling Pre-trained Language Models

Jun 07, 2021
Xin Guo, Jianlei Yang, Haoyi Zhou, Xucheng Ye, Jianxin Li

Figure 1 for RoSearch: Search for Robust Student Architectures When Distilling Pre-trained Language Models
Figure 2 for RoSearch: Search for Robust Student Architectures When Distilling Pre-trained Language Models
Figure 3 for RoSearch: Search for Robust Student Architectures When Distilling Pre-trained Language Models
Figure 4 for RoSearch: Search for Robust Student Architectures When Distilling Pre-trained Language Models

Pre-trained language models achieve outstanding performance in NLP tasks. Various knowledge distillation methods have been proposed to reduce the heavy computation and storage requirements of pre-trained language models. However, from our observations, student models acquired by knowledge distillation suffer from adversarial attacks, which limits their usage in security sensitive scenarios. In order to overcome these security problems, RoSearch is proposed as a comprehensive framework to search the student models with better adversarial robustness when performing knowledge distillation. A directed acyclic graph based search space is built and an evolutionary search strategy is utilized to guide the searching approach. Each searched architecture is trained by knowledge distillation on pre-trained language model and then evaluated under a robustness-, accuracy- and efficiency-aware metric as environmental fitness. Experimental results show that RoSearch can improve robustness of student models from 7%~18% up to 45.8%~47.8% on different datasets with comparable weight compression ratio to existing distillation methods (4.6$\times$~6.5$\times$ improvement from teacher model BERT_BASE) and low accuracy drop. In addition, we summarize the relationship between student architecture and robustness through statistics of searched models.

* 10 pages, 9 figures 
Viaarxiv icon

Optimizing Memory Efficiency of Graph Neural Networks on Edge Computing Platforms

Apr 12, 2021
Ao Zhou, Jianlei Yang, Yeqi Gao, Tong Qiao, Yingjie Qi, Xiaoyi Wang, Yunli Chen, Pengcheng Dai, Weisheng Zhao, Chunming Hu

Figure 1 for Optimizing Memory Efficiency of Graph Neural Networks on Edge Computing Platforms
Figure 2 for Optimizing Memory Efficiency of Graph Neural Networks on Edge Computing Platforms
Figure 3 for Optimizing Memory Efficiency of Graph Neural Networks on Edge Computing Platforms
Figure 4 for Optimizing Memory Efficiency of Graph Neural Networks on Edge Computing Platforms

Graph neural networks (GNN) have achieved state-of-the-art performance on various industrial tasks. However, the poor efficiency of GNN inference and frequent Out-Of-Memory (OOM) problem limit the successful application of GNN on edge computing platforms. To tackle these problems, a feature decomposition approach is proposed for memory efficiency optimization of GNN inference. The proposed approach could achieve outstanding optimization on various GNN models, covering a wide range of datasets, which speeds up the inference by up to 3x. Furthermore, the proposed feature decomposition could significantly reduce the peak memory usage (up to 5x in memory efficiency improvement) and mitigate OOM problems during GNN inference.

* This paper has been accepted by RTAS 2021(brief industry track), with link to publicly available code 
Viaarxiv icon

SparseTrain: Exploiting Dataflow Sparsity for Efficient Convolutional Neural Networks Training

Jul 21, 2020
Pengcheng Dai, Jianlei Yang, Xucheng Ye, Xingzhou Cheng, Junyu Luo, Linghao Song, Yiran Chen, Weisheng Zhao

Figure 1 for SparseTrain: Exploiting Dataflow Sparsity for Efficient Convolutional Neural Networks Training
Figure 2 for SparseTrain: Exploiting Dataflow Sparsity for Efficient Convolutional Neural Networks Training
Figure 3 for SparseTrain: Exploiting Dataflow Sparsity for Efficient Convolutional Neural Networks Training
Figure 4 for SparseTrain: Exploiting Dataflow Sparsity for Efficient Convolutional Neural Networks Training

Training Convolutional Neural Networks (CNNs) usually requires a large number of computational resources. In this paper, \textit{SparseTrain} is proposed to accelerate CNN training by fully exploiting the sparsity. It mainly involves three levels of innovations: activation gradients pruning algorithm, sparse training dataflow, and accelerator architecture. By applying a stochastic pruning algorithm on each layer, the sparsity of back-propagation gradients can be increased dramatically without degrading training accuracy and convergence rate. Moreover, to utilize both \textit{natural sparsity} (resulted from ReLU or Pooling layers) and \textit{artificial sparsity} (brought by pruning algorithm), a sparse-aware architecture is proposed for training acceleration. This architecture supports forward and back-propagation of CNN by adopting 1-Dimensional convolution dataflow. We have built %a simple compiler to map CNNs topology onto \textit{SparseTrain}, and a cycle-accurate architecture simulator to evaluate the performance and efficiency based on the synthesized design with $14nm$ FinFET technologies. Evaluation results on AlexNet/ResNet show that \textit{SparseTrain} could achieve about $2.7 \times$ speedup and $2.2 \times$ energy efficiency improvement on average compared with the original training process.

* published on DAC 2020 
Viaarxiv icon

TIPRDC: Task-Independent Privacy-Respecting Data Crowdsourcing Framework with Anonymized Intermediate Representations

Jun 12, 2020
Ang Li, Yixiao Duan, Huanrui Yang, Yiran Chen, Jianlei Yang

Figure 1 for TIPRDC: Task-Independent Privacy-Respecting Data Crowdsourcing Framework with Anonymized Intermediate Representations
Figure 2 for TIPRDC: Task-Independent Privacy-Respecting Data Crowdsourcing Framework with Anonymized Intermediate Representations
Figure 3 for TIPRDC: Task-Independent Privacy-Respecting Data Crowdsourcing Framework with Anonymized Intermediate Representations
Figure 4 for TIPRDC: Task-Independent Privacy-Respecting Data Crowdsourcing Framework with Anonymized Intermediate Representations

The success of deep learning partially benefits from the availability of various large-scale datasets. These datasets are often crowdsourced from individual users and contain private information like gender, age, etc. The emerging privacy concerns from users on data sharing hinder the generation or use of crowdsourcing datasets and lead to hunger of training data for new deep learning applications. One na\"{\i}ve solution is to pre-process the raw data to extract features at the user-side, and then only the extracted features will be sent to the data collector. Unfortunately, attackers can still exploit these extracted features to train an adversary classifier to infer private attributes. Some prior arts leveraged game theory to protect private attributes. However, these defenses are designed for known primary learning tasks, the extracted features work poorly for unknown learning tasks. To tackle the case where the learning task may be unknown or changing, we present TIPRDC, a task-independent privacy-respecting data crowdsourcing framework with anonymized intermediate representation. The goal of this framework is to learn a feature extractor that can hide the privacy information from the intermediate representations; while maximally retaining the original information embedded in the raw data for the data collector to accomplish unknown learning tasks. We design a hybrid training method to learn the anonymized intermediate representation: (1) an adversarial training process for hiding private information from features; (2) maximally retain original information using a neural-network-based mutual information estimator.

Viaarxiv icon