Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zexin Cai

Training Wake Word Detection with Synthesized Speech Data on Confusion Words

Nov 03, 2020

Yan Jia, Zexin Cai, Murong Ma, Zeqing Zhao, Xuyang Wang, Junjie Wang, Ming Li

Figure 1 for Training Wake Word Detection with Synthesized Speech Data on Confusion Words

Figure 2 for Training Wake Word Detection with Synthesized Speech Data on Confusion Words

Figure 3 for Training Wake Word Detection with Synthesized Speech Data on Confusion Words

Figure 4 for Training Wake Word Detection with Synthesized Speech Data on Confusion Words

Abstract:Confusing-words are commonly encountered in real-life keyword spotting applications, which causes severe degradation of performance due to complex spoken terms and various kinds of words that sound similar to the predefined keywords. To enhance the wake word detection system's robustness on such scenarios, we investigate two data augmentation setups for training end-to-end KWS systems. One is involving the synthesized data from a multi-speaker speech synthesis system, and the other augmentation is performed by adding random noise to the acoustic feature. Experimental results show that augmentations help improve the system's robustness. Moreover, by augmenting the training set with the synthetic data generated by the multi-speaker text-to-speech system, we achieve a significant improvement regarding confusing words scenario.

* Submitted to ICASSP 2021

Via

Access Paper or Ask Questions

Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario

May 21, 2020

Zexin Cai, Yaogen Yang, Ming Li

Figure 1 for Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario

Figure 2 for Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario

Figure 3 for Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario

Figure 4 for Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario

Abstract:Modeling voices for multiple speakers and multiple languages in one text-to-speech system has been a challenge for a long time. This paper presents an extension on Tacotron2 to achieve bilingual multispeaker speech synthesis when there are limited data for each language. We achieve cross-lingual synthesis, including code-switching cases, between English and Mandarin for monolingual speakers. The two languages share the same phonemic representations for input, while the language attribute and the speaker identity are independently controlled by language tokens and speaker embeddings, respectively. In addition, we investigate the model's performance on the cross-lingual synthesis, with and without a bilingual dataset during training. With the bilingual dataset, not only can the model generate high-fidelity speech for all speakers concerning the language they speak, but also can generate accented, yet fluent and intelligible speech for monolingual speakers regarding non-native language. For example, the Mandarin speaker can speak English fluently. Furthermore, the model trained with bilingual dataset is robust for code-switching text-to-speech, as shown in our results and provided samples.{https://caizexin.github.io/mlms-syn-samples/index.html}.

* in preparation for Neural Networks journal Special issue on Advances in Deep Learning Based Speech Processing

Via

Access Paper or Ask Questions

From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

May 15, 2020

Zexin Cai, Chuxiong Zhang, Ming Li

Figure 1 for From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

Figure 2 for From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

Figure 3 for From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

Figure 4 for From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

Abstract:High-fidelity speech can be synthesized by end-to-end text-to-speech models in recent years. However, accessing and controlling speech attributes such as speaker identity, prosody, and emotion in a text-to-speech system remains a challenge. This paper presents a system involving feedback constraint for multispeaker speech synthesis. We manage to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verification network. The constraint is taken by an added loss related to the speaker identity, which is centralized to improve the speaker similarity between the synthesized speech and its natural reference audio. The model is trained and evaluated on publicly available datasets. Experimental results, including visualization on speaker embedding space, show significant improvement in terms of speaker identity cloning in the spectrogram level. Synthesized samples are available online for listening. (https://caizexin.github.io/mlspk-syn-samples/index.html)

* Submitted to INTERSPEECH 2020

Via

Access Paper or Ask Questions

Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features

Jul 03, 2019

Zexin Cai, Yaogen Yang, Chuxiong Zhang, Xiaoyi Qin, Ming Li

Figure 1 for Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features

Figure 2 for Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features

Figure 3 for Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features

Abstract:This paper describes a conditional neural network architecture for Mandarin Chinese polyphone disambiguation. The system is composed of a bidirectional recurrent neural network component acting as a sentence encoder to accumulate the context correlations, followed by a prediction network that maps the polyphonic character embeddings along with the conditions to corresponding pronunciations. We obtain the word-level condition from a pre-trained word-to-vector lookup table. One goal of polyphone disambiguation is to address the homograph problem existing in the front-end processing of Mandarin Chinese text-to-speech system. Our system achieves an accuracy of 94.69\% on a publicly available polyphonic character dataset. To further validate our choices on the conditional feature, we investigate polyphone disambiguation systems with multi-level conditions respectively. The experimental results show that both the sentence-level and the word-level conditional embedding features are able to attain good performance for Mandarin Chinese polyphone disambiguation.

* 5 pages, 1 figure, 2 tables, submit to INTERSPEECH 2019

Via

Access Paper or Ask Questions

End-to-end Language Identification using NetFV and NetVLAD

Sep 09, 2018

Jinkun Chen, Weicheng Cai, Danwei Cai, Zexin Cai, Haibin Zhong, Ming Li

Figure 1 for End-to-end Language Identification using NetFV and NetVLAD

Figure 2 for End-to-end Language Identification using NetFV and NetVLAD

Figure 3 for End-to-end Language Identification using NetFV and NetVLAD

Abstract:In this paper, we apply the NetFV and NetVLAD layers for the end-to-end language identification task. NetFV and NetVLAD layers are the differentiable implementations of the standard Fisher Vector and Vector of Locally Aggregated Descriptors (VLAD) methods, respectively. Both of them can encode a sequence of feature vectors into a fixed dimensional vector which is very important to process those variable-length utterances. We first present the relevances and differences between the classical i-vector and the aforementioned encoding schemes. Then, we construct a flexible end-to-end framework including a convolutional neural network (CNN) architecture and an encoding layer (NetFV or NetVLAD) for the language identification task. Experimental results on the NIST LRE 2007 close-set task show that the proposed system achieves significant EER reductions against the conventional i-vector baseline and the CNN temporal average pooling system, respectively.

* Accepted for ISCSLP 2018

Via

Access Paper or Ask Questions

A Novel Learnable Dictionary Encoding Layer for End-to-End Language Identification

Apr 02, 2018

Weicheng Cai, Zexin Cai, Xiang Zhang, Xiaoqi Wang, Ming Li

Figure 1 for A Novel Learnable Dictionary Encoding Layer for End-to-End Language Identification

Abstract:A novel learnable dictionary encoding layer is proposed in this paper for end-to-end language identification. It is inline with the conventional GMM i-vector approach both theoretically and practically. We imitate the mechanism of traditional GMM training and Supervector encoding procedure on the top of CNN. The proposed layer can accumulate high-order statistics from variable-length input sequence and generate an utterance level fixed-dimensional vector representation. Unlike the conventional methods, our new approach provides an end-to-end learning framework, where the inherent dictionary are learned directly from the loss function. The dictionaries and the encoding representation for the classifier are learned jointly. The representation is orderless and therefore appropriate for language identification. We conducted a preliminary experiment on NIST LRE07 closed-set task, and the results reveal that our proposed dictionary encoding layer achieves significant error reduction comparing with the simple average pooling.

* ICASSP 2018 conference paper

Via

Access Paper or Ask Questions

Insights into End-to-End Learning Scheme for Language Identification

Apr 02, 2018

Weicheng Cai, Zexin Cai, Wenbo Liu, Xiaoqi Wang, Ming Li

Figure 1 for Insights into End-to-End Learning Scheme for Language Identification

Abstract:A novel interpretable end-to-end learning scheme for language identification is proposed. It is in line with the classical GMM i-vector methods both theoretically and practically. In the end-to-end pipeline, a general encoding layer is employed on top of the front-end CNN, so that it can encode the variable-length input sequence into an utterance level vector automatically. After comparing with the state-of-the-art GMM i-vector methods, we give insights into CNN, and reveal its role and effect in the whole pipeline. We further introduce a general encoding layer, illustrating the reason why they might be appropriate for language identification. We elaborate on several typical encoding layers, including a temporal average pooling layer, a recurrent encoding layer and a novel learnable dictionary encoding layer. We conducted experiment on NIST LRE07 closed-set task, and the results show that our proposed end-to-end systems achieve state-of-the-art performance.

* ICASSP 2018 conference paper

Via

Access Paper or Ask Questions