Alert button
Picture for Daehyun Kim

Daehyun Kim

Alert button

Iterative Compression of End-to-End ASR Model using AutoML

Aug 06, 2020
Abhinav Mehrotra, Łukasz Dudziak, Jinsu Yeo, Young-yoon Lee, Ravichander Vipperla, Mohamed S. Abdelfattah, Sourav Bhattacharya, Samin Ishtiaq, Alberto Gil C. P. Ramos, SangJeong Lee, Daehyun Kim, Nicholas D. Lane

Figure 1 for Iterative Compression of End-to-End ASR Model using AutoML
Figure 2 for Iterative Compression of End-to-End ASR Model using AutoML
Figure 3 for Iterative Compression of End-to-End ASR Model using AutoML
Figure 4 for Iterative Compression of End-to-End ASR Model using AutoML

Increasing demand for on-device Automatic Speech Recognition (ASR) systems has resulted in renewed interests in developing automatic model compression techniques. Past research have shown that AutoML-based Low Rank Factorization (LRF) technique, when applied to an end-to-end Encoder-Attention-Decoder style ASR model, can achieve a speedup of up to 3.7x, outperforming laborious manual rank-selection approaches. However, we show that current AutoML-based search techniques only work up to a certain compression level, beyond which they fail to produce compressed models with acceptable word error rates (WER). In this work, we propose an iterative AutoML-based LRF approach that achieves over 5x compression without degrading the WER, thereby advancing the state-of-the-art in ASR compression.

* INTERSPEECH 2020  
Viaarxiv icon

Attention based on-device streaming speech recognition with large speech corpus

Jan 02, 2020
Kwangyoun Kim, Kyungmin Lee, Dhananjaya Gowda, Junmo Park, Sungsoo Kim, Sichen Jin, Young-Yoon Lee, Jinsu Yeo, Daehyun Kim, Seokyeong Jung, Jungin Lee, Myoungji Han, Chanwoo Kim

Figure 1 for Attention based on-device streaming speech recognition with large speech corpus
Figure 2 for Attention based on-device streaming speech recognition with large speech corpus
Figure 3 for Attention based on-device streaming speech recognition with large speech corpus
Figure 4 for Attention based on-device streaming speech recognition with large speech corpus

In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER) training, layer-wise pre-training and data augmentation methods. In addition, we compressed our models by more than 3.4 times smaller using an iterative hyper low-rank approximation (LRA) method while minimizing the degradation in recognition accuracy. The memory footprint was further reduced with 8-bit quantization to bring down the final model size to lower than 39 MB. For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on average in word error rate (WER) for target domains including the general domain.

* Accepted and presented at the ASRU 2019 conference 
Viaarxiv icon

ScieNet: Deep Learning with Spike-assisted Contextual Information Extraction

Sep 11, 2019
Xueyuan She, Yun Long, Daehyun Kim, Saibal Mukhopadhyay

Figure 1 for ScieNet: Deep Learning with Spike-assisted Contextual Information Extraction
Figure 2 for ScieNet: Deep Learning with Spike-assisted Contextual Information Extraction
Figure 3 for ScieNet: Deep Learning with Spike-assisted Contextual Information Extraction
Figure 4 for ScieNet: Deep Learning with Spike-assisted Contextual Information Extraction

Deep neural networks (DNNs) provide high image classification accuracy, but experience significant performance degradation when perturbation from various sources are present in the input. The lack of resilience to input perturbations makes DNN less reliable for systems interacting with physical world such as autonomous vehicles, robotics, to name a few, where imperfect input is the normal condition. We present a hybrid deep network architecture with spike-assisted contextual information extraction (ScieNet). ScieNet integrates unsupervised learning using spiking neural network (SNN) for unsupervised contextual informationextraction with a back-end DNN trained for classification. The integrated network demonstrates high resilience to input perturbations without relying on prior training on perturbed inputs. We demonstrate ScieNet with different back-end DNNs for image classification using CIFAR dataset considering stochastic (noise) and structured (rain) input perturbations. Experimental results demonstrate significant improvement in accuracy on noisy and rainy images without prior training, while maintaining state-of-the-art accuracy on clean images.

Viaarxiv icon