Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shiliang Zhang

Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation

Sep 19, 2023

Luyao Cheng, Siqi Zheng, Qinglin Zhang, Hui Wang, Yafeng Chen, Qian Chen, Shiliang Zhang

Figure 1 for Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation

Figure 2 for Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation

Figure 3 for Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation

Figure 4 for Improving Speaker Diarization using Semantic Information: Joint Pairwise Constraints Propagation

Abstract:Speaker diarization has gained considerable attention within speech processing research community. Mainstream speaker diarization rely primarily on speakers' voice characteristics extracted from acoustic signals and often overlook the potential of semantic information. Considering the fact that speech signals can efficiently convey the content of a speech, it is of our interest to fully exploit these semantic cues utilizing language models. In this work we propose a novel approach to effectively leverage semantic information in clustering-based speaker diarization systems. Firstly, we introduce spoken language understanding modules to extract speaker-related semantic information and utilize these information to construct pairwise constraints. Secondly, we present a novel framework to integrate these constraints into the speaker diarization pipeline, enhancing the performance of the entire system. Extensive experiments conducted on the public dataset demonstrate the consistent superiority of our proposed approach over acoustic-only speaker diarization systems.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition

Sep 19, 2023

Ziyang Ma, Wen Wu, Zhisheng Zheng, Yiwei Guo, Qian Chen, Shiliang Zhang, Xie Chen

Figure 1 for Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition

Figure 2 for Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition

Figure 3 for Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition

Figure 4 for Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition

Abstract:In this paper, we explored how to boost speech emotion recognition (SER) with the state-of-the-art speech pre-trained model (PTM), data2vec, text generation technique, GPT-4, and speech synthesis technique, Azure TTS. First, we investigated the representation ability of different speech self-supervised pre-trained models, and we found that data2vec has a good representation ability on the SER task. Second, we employed a powerful large language model (LLM), GPT-4, and emotional text-to-speech (TTS) model, Azure TTS, to generate emotionally congruent text and speech. We carefully designed the text prompt and dataset construction, to obtain the synthetic emotional speech data with high quality. Third, we studied different ways of data augmentation to promote the SER task with synthetic speech, including random mixing, adversarial training, transfer learning, and curriculum learning. Experiments and ablation studies on the IEMOCAP dataset demonstrate the effectiveness of our method, compared with other data augmentation methods, and data augmentation with other synthetic data.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec

Sep 14, 2023

Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng

Abstract:This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech codec models, such as SoundStream and Encodec. Thanks to the unified design with FunASR, FunCodec can be easily integrated into downstream tasks, such as speech recognition. Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes. Based on the toolkit, we further propose the frequency-domain codec models, FreqCodec, which can achieve comparable speech quality with much lower computation and parameter complexity. Experimental results show that, under the same compression ratio, FunCodec can achieve better reconstruction quality compared with other toolkits and released models. We also demonstrate that the pre-trained models are suitable for downstream tasks, including automatic speech recognition and personalized text-to-speech synthesis. This toolkit is publicly available at https://github.com/alibaba-damo-academy/FunCodec.

* 5 pages, 3 figures, submitted to ICASSP 2024

Via

Access Paper or Ask Questions

Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer

Sep 14, 2023

Peng Wang, Yifan Yang, Zheng Liang, Tian Tan, Shiliang Zhang, Xie Chen

Figure 1 for Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer

Figure 2 for Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer

Figure 3 for Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer

Abstract:In spite of the excellent strides made by end-to-end (E2E) models in speech recognition in recent years, named entity recognition is still challenging but critical for semantic understanding. In order to enhance the ability to recognize named entities in E2E models, previous studies mainly focus on various rule-based or attention-based contextual biasing algorithms. However, their performance might be sensitive to the biasing weight or degraded by excessive attention to the named entity list, along with a risk of false triggering. Inspired by the success of the class-based language model (LM) in named entity recognition in conventional hybrid systems and the effective decoupling of acoustic and linguistic information in the factorized neural Transducer (FNT), we propose a novel E2E model to incorporate class-based LMs into FNT, which is referred as C-FNT. In C-FNT, the language model score of named entities can be associated with the name class instead of its surface form. The experimental results show that our proposed C-FNT presents significant error reduction in named entities without hurting performance in general word recognition.

Via

Access Paper or Ask Questions

SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus

Sep 12, 2023

Haoxu Wang, Fan Yu, Xian Shi, Yuezhang Wang, Shiliang Zhang, Ming Li

Abstract:Multi-Modal automatic speech recognition (ASR) techniques aim to leverage additional modalities to improve the performance of speech recognition systems. While existing approaches primarily focus on video or contextual information, the utilization of extra supplementary textual information has been overlooked. Recognizing the abundance of online conference videos with slides, which provide rich domain-specific information in the form of text and images, we release SlideSpeech, a large-scale audio-visual corpus enriched with slides. The corpus contains 1,705 videos, 1,000+ hours, with 473 hours of high-quality transcribed speech. Moreover, the corpus contains a significant amount of real-time synchronized slides. In this work, we present the pipeline for constructing the corpus and propose baseline methods for utilizing text information in the visual slide context. Through the application of keyword extraction and contextual ASR methods in the benchmark system, we demonstrate the potential of improving speech recognition performance by incorporating textual information from supplementary video slides.

Via

Access Paper or Ask Questions

SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization Ability

Aug 16, 2023

Xian Shi, Yexin Yang, Zerui Li, Shiliang Zhang

Figure 1 for SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization Ability

Figure 2 for SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization Ability

Figure 3 for SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization Ability

Figure 4 for SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization Ability

Abstract:Hotword customization is one of the important issues remained in ASR field - it is of value to enable users of ASR systems to customize names of entities, persons and other phrases. The past few years have seen both implicit and explicit modeling strategies for ASR contextualization developed. While these approaches have performed adequately, they still exhibit certain shortcomings such as instability in effectiveness. In this paper we propose Semantic-augmented Contextual-Paraformer (SeACo-Paraformer) a novel NAR based ASR system with flexible and effective hotword customization ability. It combines the accuracy of the AED-based model, the efficiency of the NAR model, and the excellent performance in contextualization. In 50,000 hours industrial big data experiments, our proposed model outperforms strong baselines in customization and general ASR tasks. Besides, we explore an efficient way to filter large scale incoming hotwords for further improvement. The source codes and industrial models proposed and compared are all opened as well as two hotword test sets.

* updated draft

Via

Access Paper or Ask Questions

MixBCT: Towards Self-Adapting Backward-Compatible Training

Aug 14, 2023

Yu Liang, Shiliang Zhang, Yaowei Wang, Sheng Xiao, Kenli Li, Xiaoyu Wang

Figure 1 for MixBCT: Towards Self-Adapting Backward-Compatible Training

Figure 2 for MixBCT: Towards Self-Adapting Backward-Compatible Training

Figure 3 for MixBCT: Towards Self-Adapting Backward-Compatible Training

Figure 4 for MixBCT: Towards Self-Adapting Backward-Compatible Training

Abstract:The exponential growth of data, alongside advancements in model structures and loss functions, has necessitated the enhancement of image retrieval systems through the utilization of new models with superior feature embeddings. However, the expensive process of updating the old retrieval database by replacing embeddings poses a challenge. As a solution, backward-compatible training can be employed to avoid the necessity of updating old retrieval datasets. While previous methods achieved backward compatibility by aligning prototypes of the old model, they often overlooked the distribution of the old features, thus limiting their effectiveness when the old model's low quality leads to a weakly discriminative feature distribution. On the other hand, instance-based methods like L2 regression take into account the distribution of old features but impose strong constraints on the performance of the new model itself. In this paper, we propose MixBCT, a simple yet highly effective backward-compatible training method that serves as a unified framework for old models of varying qualities. Specifically, we summarize four constraints that are essential for ensuring backward compatibility in an ideal scenario, and we construct a single loss function to facilitate backward-compatible training. Our approach adaptively adjusts the constraint domain for new features based on the distribution of the old embeddings. We conducted extensive experiments on the large-scale face recognition datasets MS1Mv3 and IJB-C to verify the effectiveness of our method. The experimental results clearly demonstrate its superiority over previous methods. Code is available at https://github.com/yuleung/MixBCT

Via

Access Paper or Ask Questions

Rethinking the visual cues in audio-visual speaker extraction

Jun 05, 2023

Junjie Li, Meng Ge, Zexu pan, Rui Cao, Longbiao Wang, Jianwu Dang, Shiliang Zhang

Figure 1 for Rethinking the visual cues in audio-visual speaker extraction

Figure 2 for Rethinking the visual cues in audio-visual speaker extraction

Figure 3 for Rethinking the visual cues in audio-visual speaker extraction

Abstract:The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel video recording to leverage two visual cues, namely speaker identity and synchronization, to enhance performance compared to audio-only algorithms. However, the visual front-end in AVSE is often derived from a pre-trained model or end-to-end trained, making it unclear which visual cue contributes more to the speaker extraction performance. This raises the question of how to better utilize visual cues. To address this issue, we propose two training strategies that decouple the learning of the two visual cues. Our experimental results demonstrate that both visual cues are useful, with the synchronization cue having a higher impact. We introduce a more explainable model, the Decoupled Audio-Visual Speaker Extraction (DAVSE) model, which leverages both visual cues.

* Accepted in Interspeech 2023

Via

Access Paper or Ask Questions

speech and noise dual-stream spectrogram refine network with speech distortion loss for robust speech recognition

May 30, 2023

Haoyu Lu, Nan Li, Tongtong Song, Longbiao Wang, Jianwu Dang, Xiaobao Wang, Shiliang Zhang

Abstract:In recent years, the joint training of speech enhancement front-end and automatic speech recognition (ASR) back-end has been widely used to improve the robustness of ASR systems. Traditional joint training methods only use enhanced speech as input for the backend. However, it is difficult for speech enhancement systems to directly separate speech from input due to the diverse types of noise with different intensities. Furthermore, speech distortion and residual noise are often observed in enhanced speech, and the distortion of speech and noise is different. Most existing methods focus on fusing enhanced and noisy features to address this issue. In this paper, we propose a dual-stream spectrogram refine network to simultaneously refine the speech and noise and decouple the noise from the noisy input. Our proposed method can achieve better performance with a relative 8.6% CER reduction.

Via

Access Paper or Ask Questions

Accurate and Reliable Confidence Estimation Based on Non-Autoregressive End-to-End Speech Recognition System

May 25, 2023

Xian Shi, Haoneng Luo, Zhifu Gao, Shiliang Zhang, Zhijie Yan

Figure 1 for Accurate and Reliable Confidence Estimation Based on Non-Autoregressive End-to-End Speech Recognition System

Figure 2 for Accurate and Reliable Confidence Estimation Based on Non-Autoregressive End-to-End Speech Recognition System

Figure 3 for Accurate and Reliable Confidence Estimation Based on Non-Autoregressive End-to-End Speech Recognition System

Figure 4 for Accurate and Reliable Confidence Estimation Based on Non-Autoregressive End-to-End Speech Recognition System

Abstract:Estimating confidence scores for recognition results is a classic task in ASR field and of vital importance for kinds of downstream tasks and training strategies. Previous end-to-end~(E2E) based confidence estimation models (CEM) predict score sequences of equal length with input transcriptions, leading to unreliable estimation when deletion and insertion errors occur. In this paper we proposed CIF-Aligned confidence estimation model (CA-CEM) to achieve accurate and reliable confidence estimation based on novel non-autoregressive E2E ASR model - Paraformer. CA-CEM utilizes the modeling character of continuous integrate-and-fire (CIF) mechanism to generate token-synchronous acoustic embedding, which solves the estimation failure issue above. We measure the quality of estimation with AUC and RMSE in token level and ECE-U - a proposed metrics in utterance level. CA-CEM gains 24% and 19% relative reduction on ECE-U and also better AUC and RMSE on two test sets. Furthermore, we conduct analysis to explore the potential of CEM for different ASR related usage.

* 5 pages, 4 figures, Interspeech2023

Via

Access Paper or Ask Questions