Alert button
Picture for Zhaoyi Liu

Zhaoyi Liu

Alert button

Beyond Universal Transformer: block reusing with adaptor in Transformer for automatic speech recognit

Mar 23, 2023
Haoyu Tang, Zhaoyi Liu, Chang Zeng, Xinfeng Li

Figure 1 for Beyond Universal Transformer: block reusing with adaptor in Transformer for automatic speech recognit
Figure 2 for Beyond Universal Transformer: block reusing with adaptor in Transformer for automatic speech recognit
Figure 3 for Beyond Universal Transformer: block reusing with adaptor in Transformer for automatic speech recognit
Figure 4 for Beyond Universal Transformer: block reusing with adaptor in Transformer for automatic speech recognit

Transformer-based models have recently made significant achievements in the application of end-to-end (E2E) automatic speech recognition (ASR). It is possible to deploy the E2E ASR system on smart devices with the help of Transformer-based models. While these models still have the disadvantage of requiring a large number of model parameters. To overcome the drawback of universal Transformer models for the application of ASR on edge devices, we propose a solution that can reuse the block in Transformer models for the occasion of the small footprint ASR system, which meets the objective of accommodating resource limitations without compromising recognition accuracy. Specifically, we design a novel block-reusing strategy for speech Transformer (BRST) to enhance the effectiveness of parameters and propose an adapter module (ADM) that can produce a compact and adaptable model with only a few additional trainable parameters accompanying each reusing block. We conducted an experiment with the proposed method on the public AISHELL-1 corpus, and the results show that the proposed approach achieves the character error rate (CER) of 9.3%/6.63% with only 7.6M/8.3M parameters without and with the ADM, respectively. In addition, we also make a deeper analysis to show the effect of ADM in the general block-reusing method.

Viaarxiv icon

Filter and evolve: progressive pseudo label refining for semi-supervised automatic speech recognition

Oct 28, 2022
Zezhong Jin, Dading Zhong, Xiao Song, Zhaoyi Liu, Naipeng Ye, Qingcheng Zeng

Figure 1 for Filter and evolve: progressive pseudo label refining for semi-supervised automatic speech recognition
Figure 2 for Filter and evolve: progressive pseudo label refining for semi-supervised automatic speech recognition
Figure 3 for Filter and evolve: progressive pseudo label refining for semi-supervised automatic speech recognition
Figure 4 for Filter and evolve: progressive pseudo label refining for semi-supervised automatic speech recognition

Fine tuning self supervised pretrained models using pseudo labels can effectively improve speech recognition performance. But, low quality pseudo labels can misguide decision boundaries and degrade performance. We propose a simple yet effective strategy to filter low quality pseudo labels to alleviate this problem. Specifically, pseudo-labels are produced over the entire training set and filtered via average probability scores calculated from the model output. Subsequently, an optimal percentage of utterances with high probability scores are considered reliable training data with trustworthy labels. The model is iteratively updated to correct the unreliable pseudo labels to minimize the effect of noisy labels. The process above is repeated until unreliable pseudo abels have been adequately corrected. Extensive experiments on LibriSpeech show that these filtered samples enable the refined model to yield more correct predictions, leading to better ASR performances under various experimental settings.

Viaarxiv icon

CT-SAT: Contextual Transformer for Sequential Audio Tagging

Mar 22, 2022
Yuanbo Hou, Zhaoyi Liu, Bo Kang, Yun Wang, Dick Botteldooren

Figure 1 for CT-SAT: Contextual Transformer for Sequential Audio Tagging
Figure 2 for CT-SAT: Contextual Transformer for Sequential Audio Tagging
Figure 3 for CT-SAT: Contextual Transformer for Sequential Audio Tagging
Figure 4 for CT-SAT: Contextual Transformer for Sequential Audio Tagging

Sequential audio event tagging can provide not only the type information of audio events, but also the order information between events and the number of events that occur in an audio clip. Most previous works on audio event sequence analysis rely on connectionist temporal classification (CTC). However, CTC's conditional independence assumption prevents it from effectively learning correlations between diverse audio events. This paper first attempts to introduce Transformer into sequential audio tagging, since Transformers perform well in sequence-related tasks. To better utilize contextual information of audio event sequences, we draw on the idea of bidirectional recurrent neural networks, and propose a contextual Transformer (cTransformer) with a bidirectional decoder that could exploit the forward and backward information of event sequences. Experiments on the real-life polyphonic audio dataset show that, compared to CTC-based methods, the cTransformer can effectively combine the fine-grained acoustic representations from the encoder and coarse-grained audio event cues to exploit contextual information to successfully recognize and predict audio event sequences.

* Submitted to interspeech 2022 
Viaarxiv icon