Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shiliang Zhang

BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

May 23, 2023

Yuhao Liang, Fan Yu, Yangze Li, Pengcheng Guo, Shiliang Zhang, Qian Chen, Lei Xie

Figure 1 for BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

Figure 2 for BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

Figure 3 for BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

Figure 4 for BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

Abstract:The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundary-aware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.

* Accepted by INTERSPEECH 2023

Via

Access Paper or Ask Questions

Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction

May 21, 2023

Mohan Shi, Yuchun Shu, Lingyun Zuo, Qian Chen, Shiliang Zhang, Jie Zhang, Li-Rong Dai

Figure 1 for Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction

Figure 2 for Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction

Figure 3 for Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction

Figure 4 for Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction

Abstract:For speech interaction, voice activity detection (VAD) is often used as a front-end. However, traditional VAD algorithms usually need to wait for a continuous tail silence to reach a preset maximum duration before segmentation, resulting in a large latency that affects user experience. In this paper, we propose a novel semantic VAD for low-latency segmentation. Different from existing methods, a frame-level punctuation prediction task is added to the semantic VAD, and the artificial endpoint is included in the classification category in addition to the often-used speech presence and absence. To enhance the semantic information of the model, we also incorporate an automatic speech recognition (ASR) related semantic loss. Evaluations on an internal dataset show that the proposed method can reduce the average latency by 53.3% without significant deterioration of character error rate in the back-end ASR compared to the traditional VAD approach.

* Accepted by Interspeech2023

Via

Access Paper or Ask Questions

CASA-ASR: Context-Aware Speaker-Attributed ASR

May 21, 2023

Mohan Shi, Zhihao Du, Qian Chen, Fan Yu, Yangze Li, Shiliang Zhang, Jie Zhang, Li-Rong Dai

Figure 1 for CASA-ASR: Context-Aware Speaker-Attributed ASR

Figure 2 for CASA-ASR: Context-Aware Speaker-Attributed ASR

Figure 3 for CASA-ASR: Context-Aware Speaker-Attributed ASR

Figure 4 for CASA-ASR: Context-Aware Speaker-Attributed ASR

Abstract:Recently, speaker-attributed automatic speech recognition (SA-ASR) has attracted a wide attention, which aims at answering the question ``who spoke what''. Different from modular systems, end-to-end (E2E) SA-ASR minimizes the speaker-dependent recognition errors directly and shows a promising applicability. In this paper, we propose a context-aware SA-ASR (CASA-ASR) model by enhancing the contextual modeling ability of E2E SA-ASR. Specifically, in CASA-ASR, a contextual text encoder is involved to aggregate the semantic information of the whole utterance, and a context-dependent scorer is employed to model the speaker discriminability by contrasting with speakers in the context. In addition, a two-pass decoding strategy is further proposed to fully leverage the contextual modeling ability resulting in a better recognition performance. Experimental results on AliMeeting corpus show that the proposed CASA-ASR model outperforms the original E2E SA-ASR system with a relative improvement of 11.76% in terms of speaker-dependent character error rate.

* Accepted by Interspeech2023

Via

Access Paper or Ask Questions

BAT: Boundary aware transducer for memory-efficient and low-latency ASR

May 19, 2023

Keyu An, Xian Shi, Shiliang Zhang

Figure 1 for BAT: Boundary aware transducer for memory-efficient and low-latency ASR

Figure 2 for BAT: Boundary aware transducer for memory-efficient and low-latency ASR

Figure 3 for BAT: Boundary aware transducer for memory-efficient and low-latency ASR

Figure 4 for BAT: Boundary aware transducer for memory-efficient and low-latency ASR

Abstract:Recently, recurrent neural network transducer (RNN-T) gains increasing popularity due to its natural streaming capability as well as superior performance. Nevertheless, RNN-T training requires large time and computation resources as RNN-T loss calculation is slow and consumes a lot of memory. Another limitation of RNN-T is that it tends to access more contexts for better performance, thus leading to higher emission latency in streaming ASR. In this paper we propose boundary-aware transducer (BAT) for memory-efficient and low-latency ASR. In BAT, the lattice for RNN-T loss computation is reduced to a restricted region selected by the alignment from continuous integrate-and-fire (CIF), which is jointly optimized with the RNN-T model. Extensive experiments demonstrate that compared to RNN-T, BAT reduces time and memory consumption significantly in training, and achieves good CER-latency trade-offs in inference for streaming ASR.

* accepted into INTERSPEECH2023

Via

Access Paper or Ask Questions

FunASR: A Fundamental End-to-End Speech Recognition Toolkit

May 18, 2023

Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao(+1 more)

Figure 1 for FunASR: A Fundamental End-to-End Speech Recognition Toolkit

Figure 2 for FunASR: A Fundamental End-to-End Speech Recognition Toolkit

Figure 3 for FunASR: A Fundamental End-to-End Speech Recognition Toolkit

Figure 4 for FunASR: A Fundamental End-to-End Speech Recognition Toolkit

Abstract:This paper introduces FunASR, an open-source speech recognition toolkit designed to bridge the gap between academic research and industrial applications. FunASR offers models trained on large-scale industrial corpora and the ability to deploy them in applications. The toolkit's flagship model, Paraformer, is a non-autoregressive end-to-end speech recognition model that has been trained on a manually annotated Mandarin speech recognition dataset that contains 60,000 hours of speech. To improve the performance of Paraformer, we have added timestamp prediction and hotword customization capabilities to the standard Paraformer backbone. In addition, to facilitate model deployment, we have open-sourced a voice activity detection model based on the Feedforward Sequential Memory Network (FSMN-VAD) and a text post-processing punctuation model based on the controllable time-delay Transformer (CT-Transformer), both of which were trained on industrial corpora. These functional modules provide a solid foundation for building high-precision long audio speech recognition services. Compared to other models trained on open datasets, Paraformer demonstrates superior performance.

* 5 pages, 3 figures, accepted by INTERSPEECH 2023

Via

Access Paper or Ask Questions

TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization

Mar 08, 2023

Jiaming Wang, Zhihao Du, Shiliang Zhang

Figure 1 for TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization

Figure 2 for TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization

Figure 3 for TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization

Figure 4 for TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization

Abstract:Recently, end-to-end neural diarization (EEND) is introduced and achieves promising results in speaker-overlapped scenarios. In EEND, speaker diarization is formulated as a multi-label prediction problem, where speaker activities are estimated independently and their dependency are not well considered. To overcome these disadvantages, we employ the power set encoding to reformulate speaker diarization as a single-label classification problem and propose the overlap-aware EEND (EEND-OLA) model, in which speaker overlaps and dependency can be modeled explicitly. Inspired by the success of two-stage hybrid systems, we further propose a novel Two-stage OverLap-aware Diarization framework (TOLD) by involving a speaker overlap-aware post-processing (SOAP) model to iteratively refine the diarization results of EEND-OLA. Experimental results show that, compared with the original EEND, the proposed EEND-OLA achieves a 14.39% relative improvement in terms of diarization error rates (DER), and utilizing SOAP provides another 19.33% relative improvement. As a result, our method TOLD achieves a DER of 10.14% on the CALLHOME dataset, which is a new state-of-the-art result on this benchmark to the best of our knowledge.

* Accepted by ICASSP2023

Via

Access Paper or Ask Questions

Achieving Timestamp Prediction While Recognizing with Non-Autoregressive End-to-End ASR Model

Jan 29, 2023

Xian Shi, Yanni Chen, Shiliang Zhang, Zhijie Yan

Figure 1 for Achieving Timestamp Prediction While Recognizing with Non-Autoregressive End-to-End ASR Model

Figure 2 for Achieving Timestamp Prediction While Recognizing with Non-Autoregressive End-to-End ASR Model

Figure 3 for Achieving Timestamp Prediction While Recognizing with Non-Autoregressive End-to-End ASR Model

Figure 4 for Achieving Timestamp Prediction While Recognizing with Non-Autoregressive End-to-End ASR Model

Abstract:Conventional ASR systems use frame-level phoneme posterior to conduct force-alignment~(FA) and provide timestamps, while end-to-end ASR systems especially AED based ones are short of such ability. This paper proposes to perform timestamp prediction~(TP) while recognizing by utilizing continuous integrate-and-fire~(CIF) mechanism in non-autoregressive ASR model - Paraformer. Foucing on the fire place bias issue of CIF, we conduct post-processing strategies including fire-delay and silence insertion. Besides, we propose to use scaled-CIF to smooth the weights of CIF output, which is proved beneficial for both ASR and TP task. Accumulated averaging shift~(AAS) and diarization error rate~(DER) are adopted to measure the quality of timestamps and we compare these metrics of proposed system and conventional hybrid force-alignment system. The experiment results over manually-marked timestamps testset show that the proposed optimization methods significantly improve the accuracy of CIF timestamps, reducing 66.7\% and 82.1\% of AAS and DER respectively. Comparing to Kaldi force-alignment trained with the same data, optimized CIF timestamps achieved 12.3\% relative AAS reduction.

Via

Access Paper or Ask Questions

MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

Nov 29, 2022

Xiaohuan Zhou, Jiaming Wang, Zeyu Cui, Shiliang Zhang, Zhijie Yan, Jingren Zhou, Chang Zhou

Figure 1 for MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

Figure 2 for MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

Figure 3 for MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

Figure 4 for MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

Abstract:In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other languages with an alphabetic writing system, Mandarin uses an ideographic writing system where character and sound are not tightly mapped to one another. Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text. Specifically, we employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. For end-to-end pre-training, we introduce self-supervised speech-to-pseudo-codes (S2C) and phoneme-to-text (P2T) tasks utilizing unlabeled speech and text data, where speech-pseudo-codes pairs and phoneme-text pairs are a supplement to the supervised speech-text pairs. To train the encoder to learn better speech representation, we introduce self-supervised masked speech prediction (MSP) and supervised phoneme prediction (PP) tasks to learn to map speech into phonemes. Besides, we directly add the downstream supervised speech-to-text (S2T) task into the pre-training process, which can further improve the pre-training performance and achieve better recognition results even without fine-tuning. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.

* Submitted to ICASSP 2023

Via

Access Paper or Ask Questions

Deep Active Learning for Computer Vision: Past and Future

Nov 27, 2022

Rinyoichi Takezoe, Xu Liu, Shunan Mao, Marco Tianyu Chen, Zhanpeng Feng, Shiliang Zhang, Xiaoyu Wang

Figure 1 for Deep Active Learning for Computer Vision: Past and Future

Figure 2 for Deep Active Learning for Computer Vision: Past and Future

Figure 3 for Deep Active Learning for Computer Vision: Past and Future

Figure 4 for Deep Active Learning for Computer Vision: Past and Future

Abstract:As an important data selection schema, active learning emerges as the essential component when iterating an Artificial Intelligence (AI) model. It becomes even more critical given the dominance of deep neural network based models, which are composed of a large number of parameters and data hungry, in application. Despite its indispensable role for developing AI models, research on active learning is not as intensive as other research directions. In this paper, we present a review of active learning through deep active learning approaches from the following perspectives: 1) technical advancements in active learning, 2) applications of active learning in computer vision, 3) industrial systems leveraging or with potential to leverage active learning for data iteration, 4) current limitations and future research directions. We expect this paper to clarify the significance of active learning in a modern AI model manufacturing process and to bring additional research attention to active learning. By addressing data automation challenges and coping with automated machine learning systems, active learning will facilitate democratization of AI technologies by boosting model production at scale.

Via

Access Paper or Ask Questions

Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis

Nov 18, 2022

Zhihao Du, Shiliang Zhang, Siqi Zheng, Zhijie Yan

Abstract:Recently, hybrid systems of clustering and neural diarization models have been successfully applied in multi-party meeting analysis. However, current models always treat overlapped speaker diarization as a multi-label classification problem, where speaker dependency and overlaps are not well considered. To overcome the disadvantages, we reformulate overlapped speaker diarization task as a single-label prediction problem via the proposed power set encoding (PSE). Through this formulation, speaker dependency and overlaps can be explicitly modeled. To fully leverage this formulation, we further propose the speaker overlap-aware neural diarization (SOND) model, which consists of a context-independent (CI) scorer to model global speaker discriminability, a context-dependent scorer (CD) to model local discriminability, and a speaker combining network (SCN) to combine and reassign speaker activities. Experimental results show that using the proposed formulation can outperform the state-of-the-art methods based on target speaker voice activity detection, and the performance can be further improved with SOND, resulting in a 6.30% relative diarization error reduction.

* Accepted by EMNLP 2022

Via

Access Paper or Ask Questions