Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuexian Zou

LAE: Language-Aware Encoder for Monolingual and Multilingual ASR

Jun 05, 2022

Jinchuan Tian, Jianwei Yu, Chunlei Zhang, Chao Weng, Yuexian Zou, Dong Yu

Figure 1 for LAE: Language-Aware Encoder for Monolingual and Multilingual ASR

Figure 2 for LAE: Language-Aware Encoder for Monolingual and Multilingual ASR

Figure 3 for LAE: Language-Aware Encoder for Monolingual and Multilingual ASR

Figure 4 for LAE: Language-Aware Encoder for Monolingual and Multilingual ASR

Abstract:Despite the rapid progress in automatic speech recognition (ASR) research, recognizing multilingual speech using a unified ASR system remains highly challenging. Previous works on multilingual speech recognition mainly focus on two directions: recognizing multiple monolingual speech or recognizing code-switched speech that uses different languages interchangeably within a single utterance. However, a pragmatic multilingual recognizer is expected to be compatible with both directions. In this work, a novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information and generating frame-level language-aware representations during encoding. In the LAE, the primary encoding is implemented by the shared block while the language-specific blocks are used to extract specific representations for each language. To learn language-specific information discriminatively, a language-aware training method is proposed to optimize the language-specific blocks in LAE. Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level and shows superior performance on both monolingual and multilingual ASR tasks. With either a real-recorded or simulated code-switched dataset, the proposed LAE achieves statistically significant improvements on both CTC and neural transducer systems. Code is released

Via

Access Paper or Ask Questions

Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention

May 03, 2022

Xinmeng Xu, Rongzhi Gu, Yuexian Zou

Figure 1 for Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention

Figure 2 for Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention

Figure 3 for Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention

Figure 4 for Improving Dual-Microphone Speech Enhancement by Learning Cross-Channel Features with Multi-Head Attention

Abstract:Hand-crafted spatial features, such as inter-channel intensity difference (IID) and inter-channel phase difference (IPD), play a fundamental role in recent deep learning based dual-microphone speech enhancement (DMSE) systems. However, learning the mutual relationship between artificially designed spatial and spectral features is hard in the end-to-end DMSE. In this work, a novel architecture for DMSE using a multi-head cross-attention based convolutional recurrent network (MHCA-CRN) is presented. The proposed MHCA-CRN model includes a channel-wise encoding structure for preserving intra-channel features and a multi-head cross-attention mechanism for fully exploiting cross-channel features. In addition, the proposed approach specifically formulates the decoder with an extra SNR estimator to estimate frame-level SNR under a multi-task learning framework, which is expected to avoid speech distortion led by end-to-end DMSE module. Finally, a spectral gain function is adopted to further suppress the unnatural residual noise. Experiment results demonstrated superior performance of the proposed model against several state-of-the-art models.

* Accepted by ICASSP 2022

Via

Access Paper or Ask Questions

End-to-end Spoken Conversational Question Answering: Task, Dataset and Model

Apr 29, 2022

Chenyu You, Nuo Chen, Fenglin Liu, Shen Ge, Xian Wu, Yuexian Zou

Figure 1 for End-to-end Spoken Conversational Question Answering: Task, Dataset and Model

Figure 2 for End-to-end Spoken Conversational Question Answering: Task, Dataset and Model

Figure 3 for End-to-end Spoken Conversational Question Answering: Task, Dataset and Model

Figure 4 for End-to-end Spoken Conversational Question Answering: Task, Dataset and Model

Abstract:In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts. However, the most natural way that human seek or test their knowledge is via human conversations. Therefore, we propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows given the speech documents. In this task, our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering. To this end, instead of directly adopting automatically generated speech transcripts with highly noisy data, we propose a novel unified data distillation approach, DDNet, which effectively ingests cross-modal information to achieve fine-grained representations of the speech and language modalities. Moreover, we propose a simple and novel mechanism, termed Dual Attention, by encouraging better alignments between audio and text to ease the process of knowledge transfer. To evaluate the capacity of SCQA systems in a dialogue-style interaction, we assemble a Spoken Conversational Question Answering (Spoken-CoQA) dataset with more than 40k question-answer pairs from 4k conversations. The performance of the existing state-of-the-art methods significantly degrade on our dataset, hence demonstrating the necessity of cross-modal information integration. Our experimental results demonstrate that our proposed method achieves superior performance in spoken conversational question answering tasks.

* In Findings of NAACL 2022. arXiv admin note: substantial text overlap with arXiv:2010.08923

Via

Access Paper or Ask Questions

Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction

Apr 15, 2022

Zifeng Zhao, Rongzhi Gu, Dongchao Yang, Jinchuan Tian, Yuexian Zou

Figure 1 for Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction

Figure 2 for Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction

Figure 3 for Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction

Figure 4 for Speaker-Aware Mixture of Mixtures Training for Weakly Supervised Speaker Extraction

Abstract:Dominant researches adopt supervised training for speaker extraction, while the scarcity of ideally clean corpus and channel mismatch problem are rarely considered. To this end, we propose speaker-aware mixture of mixtures training (SAMoM), utilizing the consistency of speaker identity among target source, enrollment utterance and target estimate to weakly supervise the training of a deep speaker extractor. In SAMoM, the input is constructed by mixing up different speaker-aware mixtures (SAMs), each contains multiple speakers with their identities known and enrollment utterances available. Informed by enrollment utterances, target speech is extracted from the input one by one, such that the estimated targets can approximate the original SAMs after a remix in accordance with the identity consistency. Moreover, using SAMoM in a semi-supervised setting with a certain amount of clean sources enables application in noisy scenarios. Extensive experiments on Libri2Mix show that the proposed method achieves promising results without access to any clean sources (11.06dB SI-SDRi). With a domain adaptation, our approach even outperformed supervised framework in a cross-domain evaluation on AISHELL-1.

* 5 pages, 4 tables, 4 figures. Submitted to INTERSPEECH 2022

Via

Access Paper or Ask Questions

RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection

Apr 05, 2022

Dongchao Yang, Helin Wang, Zhongjie Ye, Yuexian Zou, Wenwu Wang

Figure 1 for RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection

Figure 2 for RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection

Figure 3 for RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection

Figure 4 for RaDur: A Reference-aware and Duration-robust Network for Target Sound Detection

Abstract:Target sound detection (TSD) aims to detect the target sound from a mixture audio given the reference information. Previous methods use a conditional network to extract a sound-discriminative embedding from the reference audio, and then use it to detect the target sound from the mixture audio. However, the network performs much differently when using different reference audios (e.g. performs poorly for noisy and short-duration reference audios), and tends to make wrong decisions for transient events (i.e. shorter than $1$ second). To overcome these problems, in this paper, we present a reference-aware and duration-robust network (RaDur) for TSD. More specifically, in order to make the network more aware of the reference information, we propose an embedding enhancement module to take into account the mixture audio while generating the embedding, and apply the attention pooling to enhance the features of target sound-related frames and weaken the features of noisy frames. In addition, a duration-robust focal loss is proposed to help model different-duration events. To evaluate our method, we build two TSD datasets based on UrbanSound and Audioset. Extensive experiments show the effectiveness of our methods.

* submitted to interspeech2022

Via

Access Paper or Ask Questions

A Two-student Learning Framework for Mixed Supervised Target Sound Detection

Apr 05, 2022

Dongchao Yang, Helin Wang, Yuexian Zou, Wenwu Wang

Figure 1 for A Two-student Learning Framework for Mixed Supervised Target Sound Detection

Figure 2 for A Two-student Learning Framework for Mixed Supervised Target Sound Detection

Figure 3 for A Two-student Learning Framework for Mixed Supervised Target Sound Detection

Figure 4 for A Two-student Learning Framework for Mixed Supervised Target Sound Detection

Abstract:Target sound detection (TSD) aims to detect the target sound from mixture audio given the reference information. Previous work shows that a good detection performance relies on fully-annotated data. However, collecting fully-annotated data is labor-extensive. Therefore, we consider TSD with mixed supervision, which learns novel categories (target domain) using weak annotations with the help of full annotations of existing base categories (source domain). We propose a novel two-student learning framework, which contains two mutual helping student models ($\mathit{s\_student}$ and $\mathit{w\_student}$) that learn from fully- and weakly-annotated datasets, respectively. Specifically, we first propose a frame-level knowledge distillation strategy to transfer the class-agnostic knowledge from $\mathit{s\_student}$ to $\mathit{w\_student}$. After that, a pseudo supervised (PS) training is designed to transfer the knowledge from $\mathit{w\_student}$ to $\mathit{s\_student}$. Lastly, an adversarial training strategy is proposed, which aims to align the data distribution between source and target domains. To evaluate our method, we build three TSD datasets based on UrbanSound and Audioset. Experimental results show that our methods offer about 8\% improvement in event-based F score.

* submitted to interspeech2022

Via

Access Paper or Ask Questions

Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches

Apr 04, 2022

Zifeng Zhao, Dongchao Yang, Rongzhi Gu, Haoran Zhang, Yuexian Zou

Figure 1 for Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches

Figure 2 for Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches

Figure 3 for Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches

Figure 4 for Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches

Abstract:Recently, end-to-end speaker extraction has attracted increasing attention and shown promising results. However, its performance is often inferior to that of a blind source separation (BSS) counterpart with a similar network architecture, due to the auxiliary speaker encoder may sometimes generate ambiguous speaker embeddings. Such ambiguous guidance information may confuse the separation network and hence lead to wrong extraction results, which deteriorates the overall performance. We refer to this as the target confusion problem. In this paper, we conduct an analysis of such an issue and solve it in two stages. In the training phase, we propose to integrate metric learning methods to improve the distinguishability of embeddings produced by the speaker encoder. While for inference, a novel post-filtering strategy is designed to revise the wrong results. Specifically, we first identify these confusion samples by measuring the similarities between output estimates and enrollment utterances, after which the true target sources are recovered by a subtraction operation. Experiments show that performance improvement of more than 1dB SI-SDRi can be brought, which validates the effectiveness of our methods and emphasizes the impact of the target confusion problem.

* 5 pages, 1 table, 5 figures. Submitted to INTERSPEECH 2022

Via

Access Paper or Ask Questions

Improving Target Sound Extraction with Timestamp Information

Apr 02, 2022

Helin Wang, Dongchao Yang, Chao Weng, Jianwei Yu, Yuexian Zou

Figure 1 for Improving Target Sound Extraction with Timestamp Information

Figure 2 for Improving Target Sound Extraction with Timestamp Information

Figure 3 for Improving Target Sound Extraction with Timestamp Information

Figure 4 for Improving Target Sound Extraction with Timestamp Information

Abstract:Target sound extraction (TSE) aims to extract the sound part of a target sound event class from a mixture audio with multiple sound events. The previous works mainly focus on the problems of weakly-labelled data, jointly learning and new classes, however, no one cares about the onset and offset times of the target sound event, which has been emphasized in the auditory scene analysis. In this paper, we study to utilize such timestamp information to help extract the target sound via a target sound detection network and a target-weighted time-frequency loss function. More specifically, we use the detection result of a target sound detection (TSD) network as the additional information to guide the learning of target sound extraction network. We also find that the result of TSE can further improve the performance of the TSD network, so that a mutual learning framework of the target sound detection and extraction is proposed. In addition, a target-weighted time-frequency loss function is designed to pay more attention to the temporal regions of the target sound during training. Experimental results on the synthesized data generated from the Freesound Datasets show that our proposed method can significantly improve the performance of TSE.

* submitted to interspeech2022

Via

Access Paper or Ask Questions

Integrate Lattice-Free MMI into End-to-End Speech Recognition

Apr 02, 2022

Jinchuan Tian, Jianwei Yu, Chao Weng, Yuexian Zou, Dong Yu

Figure 1 for Integrate Lattice-Free MMI into End-to-End Speech Recognition

Figure 2 for Integrate Lattice-Free MMI into End-to-End Speech Recognition

Figure 3 for Integrate Lattice-Free MMI into End-to-End Speech Recognition

Figure 4 for Integrate Lattice-Free MMI into End-to-End Speech Recognition

Abstract:In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems. Given this success, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems. With this motivation, previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems. However, the effectiveness and efficiency of the MBR-based methods are compromised: the MBR criterion is only used in system training, which creates a mismatch between training and decoding; the on-the-fly decoding process in MBR-based methods results in the need for pre-trained models and slow training speeds. To this end, novel algorithms are proposed in this work to integrate another widely used discriminative criterion, lattice-free maximum mutual information (LF-MMI), into E2E ASR systems not only in the training stage but also in the decoding process. The proposed LF-MMI training and decoding methods show their effectiveness on two widely used E2E frameworks: Attention-Based Encoder-Decoders (AEDs) and Neural Transducers (NTs). Compared with MBR-based methods, the proposed LF-MMI method: maintains the consistency between training and decoding; eschews the on-the-fly decoding process; trains from randomly initialized models with superior training efficiency. Experiments suggest that the LF-MMI method outperforms its MBR counterparts and consistently leads to statistically significant performance improvements on various frameworks and datasets from 30 hours to 14.3k hours. The proposed method achieves state-of-the-art (SOTA) results on Aishell-1 (CER 4.10%) and Aishell-2 (CER 5.02%) datasets. Code is released.

Via

Access Paper or Ask Questions

Learning Decoupling Features Through Orthogonality Regularization

Mar 31, 2022

Li Wang, Rongzhi Gu, Weiji Zhuang, Peng Gao, Yujun Wang, Yuexian Zou

Figure 1 for Learning Decoupling Features Through Orthogonality Regularization

Figure 2 for Learning Decoupling Features Through Orthogonality Regularization

Figure 3 for Learning Decoupling Features Through Orthogonality Regularization

Figure 4 for Learning Decoupling Features Through Orthogonality Regularization

Abstract:Keyword spotting (KWS) and speaker verification (SV) are two important tasks in speech applications. Research shows that the state-of-art KWS and SV models are trained independently using different datasets since they expect to learn distinctive acoustic features. However, humans can distinguish language content and the speaker identity simultaneously. Motivated by this, we believe it is important to explore a method that can effectively extract common features while decoupling task-specific features. Bearing this in mind, a two-branch deep network (KWS branch and SV branch) with the same network structure is developed and a novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously where speaker-invariant keyword representations and keyword-invariant speaker representations are expected respectively. Experiments are conducted on Google Speech Commands Dataset (GSCD). The results demonstrate that the orthogonality regularization helps the network to achieve SOTA EER of 1.31% and 1.87% on KWS and SV, respectively.

* Accepted at ICASSP 2022

Via

Access Paper or Ask Questions