Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhengqi Wen

UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

Jan 10, 2023

Haogeng Liu, Tao Wang, Ruibo Fu, Jiangyan Yi, Zhengqi Wen, Jianhua Tao

Abstract:Text-to-speech (TTS) and voice conversion (VC) are two different tasks both aiming at generating high quality speaking voice according to different input modality. Due to their similarity, this paper proposes UnifySpeech, which brings TTS and VC into a unified framework for the first time. The model is based on the assumption that speech can be decoupled into three independent components: content information, speaker information, prosody information. Both TTS and VC can be regarded as mining these three parts of information from the input and completing the reconstruction of speech. For TTS, the speech content information is derived from the text, while in VC it's derived from the source speech, so all the remaining units are shared except for the speech content extraction module in the two tasks. We applied vector quantization and domain constrain to bridge the gap between the content domains of TTS and VC. Objective and subjective evaluation shows that by combining the two task, TTS obtains better speaker modeling ability while VC gets hold of impressive speech content decoupling capability.

Via

Access Paper or Ask Questions

Emotion Selectable End-to-End Text-based Speech Editing

Dec 20, 2022

Tao Wang, Jiangyan Yi, Ruibo Fu, Jianhua Tao, Zhengqi Wen, Chu Yuan Zhang

Abstract:Text-based speech editing allows users to edit speech by intuitively cutting, copying, and pasting text to speed up the process of editing speech. In the previous work, CampNet (context-aware mask prediction network) is proposed to realize text-based speech editing, significantly improving the quality of edited speech. This paper aims at a new task: adding emotional effect to the editing speech during the text-based speech editing to make the generated speech more expressive. To achieve this task, we propose Emo-CampNet (emotion CampNet), which can provide the option of emotional attributes for the generated speech in text-based speech editing and has the one-shot ability to edit unseen speakers' speech. Firstly, we propose an end-to-end emotion-selectable text-based speech editing model. The key idea of the model is to control the emotion of generated speech by introducing additional emotion attributes based on the context-aware mask prediction network. Secondly, to prevent the emotion of the generated speech from being interfered by the emotional components in the original speech, a neutral content generator is proposed to remove the emotion from the original speech, which is optimized by the generative adversarial framework. Thirdly, two data augmentation methods are proposed to enrich the emotional and pronunciation information in the training set, which can enable the model to edit the unseen speaker's speech. The experimental results that 1) Emo-CampNet can effectively control the emotion of the generated speech in the process of text-based speech editing; And can edit unseen speakers' speech. 2) Detailed ablation experiments further prove the effectiveness of emotional selectivity and data augmentation methods. The demo page is available at https://hairuo55.github.io/Emo-CampNet/

* Under review, 12 pages, 11 figures, demo page is available at https://hairuo55.github.io/Emo-CampNet/

Via

Access Paper or Ask Questions

Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS

Oct 20, 2022

Chunyu Qiang, Jianhua Tao, Ruibo Fu, Zhengqi Wen, Jiangyan Yi, Tao Wang, Shiming Wang

Abstract:Current end-to-end code-switching Text-to-Speech (TTS) can already generate high quality two languages speech in the same utterance with single speaker bilingual corpora. When the speakers of the bilingual corpora are different, the naturalness and consistency of the code-switching TTS will be poor. The cross-lingual embedding layers structure we proposed makes similar syllables in different languages relevant, thus improving the naturalness and consistency of generated speech. In the end-to-end code-switching TTS, there exists problem of prosody instability when synthesizing paragraph text. The text enhancement method we proposed makes the input contain prosodic information and sentence-level context information, thus improving the prosody stability of paragraph text. Experimental results demonstrate the effectiveness of the proposed methods in the naturalness, consistency, and prosody stability. In addition to Mandarin and English, we also apply these methods to Shanghaiese and Cantonese corpora, proving that the methods we proposed can be extended to other languages to build end-to-end code-switching TTS system.

* accepted in ISCSLP 2021

Via

Access Paper or Ask Questions

Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

Aug 02, 2022

Jun Xue, Cunhang Fan, Zhao Lv, Jianhua Tao, Jiangyan Yi, Chengshi Zheng, Zhengqi Wen, Minmin Yuan, Shegang Shao

Figure 1 for Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

Figure 2 for Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

Figure 3 for Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

Figure 4 for Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

Abstract:Recently, pioneer research works have proposed a large number of acoustic features (log power spectrogram, linear frequency cepstral coefficients, constant Q cepstral coefficients, etc.) for audio deepfake detection, obtaining good performance, and showing that different subbands have different contributions to audio deepfake detection. However, this lacks an explanation of the specific information in the subband, and these features also lose information such as phase. Inspired by the mechanism of synthetic speech, the fundamental frequency (F0) information is used to improve the quality of synthetic speech, while the F0 of synthetic speech is still too average, which differs significantly from that of real speech. It is expected that F0 can be used as important information to discriminate between bonafide and fake speech, while this information cannot be used directly due to the irregular distribution of F0. Insteadly, the frequency band containing most of F0 is selected as the input feature. Meanwhile, to make full use of the phase and full-band information, we also propose to use real and imaginary spectrogram features as complementary input features and model the disjoint subbands separately. Finally, the results of F0, real and imaginary spectrogram features are fused. Experimental results on the ASVspoof 2019 LA dataset show that our proposed system is very effective for the audio deepfake detection task, achieving an equivalent error rate (EER) of 0.43%, which surpasses almost all systems.

Via

Access Paper or Ask Questions

NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

Mar 05, 2022

Tao Wang, Ruibo Fu, Jiangyan Yi, Jianhua Tao, Zhengqi Wen

Figure 1 for NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

Figure 2 for NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

Figure 3 for NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

Figure 4 for NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

Abstract:The traditional vocoders have the advantages of high synthesis efficiency, strong interpretability, and speech editability, while the neural vocoders have the advantage of high synthesis quality. To combine the advantages of two vocoders, inspired by the traditional deterministic plus stochastic model, this paper proposes a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability. Firstly, this framework contains four modules: a deterministic source module, a stochastic source module, a neural V/UV decision module and a neural filter module. The input required by the vocoder is just the spectral parameter, which avoids the error caused by estimating additional parameters, such as F0. Secondly, to solve the problem that different frequency bands may have different proportions of deterministic components and stochastic components, a multiband excitation strategy is used to generate a more accurate excitation signal and reduce the neural filter's burden. Thirdly, a method to control noise components of speech is proposed. In this way, the signal-to-noise ratio (SNR) of speech can be adjusted easily. Objective and subjective experimental results show that our proposed NeuralDPS vocoder can obtain similar performance with the WaveNet and it generates waveforms at least 280 times faster than the WaveNet vocoder. It is also 28% faster than WaveGAN's synthesis efficiency on a single CPU core. We have also verified through experiments that this method can effectively control the noise components in the predicted speech and adjust the SNR of speech. Examples of generated speech can be found at https://hairuo55.github.io/NeuralDPS.

* 15 pages, 12 figures; Accepted to TASLP. Demo page https://hairuo55.github.io/NeuralDPS. arXiv admin note: text overlap with arXiv:1906.09573 by other authors

Via

Access Paper or Ask Questions

ADD 2022: the First Audio Deep Synthesis Detection Challenge

Feb 26, 2022

Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan(+9 more)

Figure 1 for ADD 2022: the First Audio Deep Synthesis Detection Challenge

Figure 2 for ADD 2022: the First Audio Deep Synthesis Detection Challenge

Figure 3 for ADD 2022: the First Audio Deep Synthesis Detection Challenge

Figure 4 for ADD 2022: the First Audio Deep Synthesis Detection Challenge

Abstract:Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021. However, the recent shared tasks have not covered many real-life and challenging scenarios. The first Audio Deep synthesis Detection challenge (ADD) was motivated to fill in the gap. The ADD 2022 includes three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) and audio fake game (FG). The LF track focuses on dealing with bona fide and fully fake utterances with various real-world noises etc. The PF track aims to distinguish the partially fake audio from the real. The FG track is a rivalry game, which includes two tasks: an audio generation task and an audio fake detection task. In this paper, we describe the datasets, evaluation metrics, and protocols. We also report major findings that reflect the recent advances in audio deepfake detection tasks.

* Accepted by ICASSP 2022

Via

Access Paper or Ask Questions

CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

Feb 21, 2022

Tao Wang, Jiangyan Yi, Ruibo Fu, Jianhua Tao, Zhengqi Wen

Figure 1 for CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

Figure 2 for CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

Figure 3 for CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

Figure 4 for CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

Abstract:The text-based speech editor allows the editing of speech through intuitive cutting, copying, and pasting operations to speed up the process of editing speech. However, the major drawback of current systems is that edited speech often sounds unnatural due to cut-copy-paste operation. In addition, it is not obvious how to synthesize records according to a new word not appearing in the transcript. This paper proposes a novel end-to-end text-based speech editing method called context-aware mask prediction network (CampNet). The model can simulate the text-based speech editing process by randomly masking part of speech and then predicting the masked region by sensing the speech context. It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript. Secondly, for the possible operation of text-based speech editing, we design three text-based operations based on CampNet: deletion, insertion, and replacement. These operations can cover various situations of speech editing. Thirdly, to synthesize the speech corresponding to long text in insertion and replacement operations, a word-level autoregressive generation method is proposed. Fourthly, we propose a speaker adaptation method using only one sentence for CampNet and explore the ability of few-shot learning based on CampNet, which provides a new idea for speech forgery tasks. The subjective and objective experiments on VCTK and LibriTTS datasets show that the speech editing results based on CampNet are better than TTS technology, manual editing, and VoCo method. We also conduct detailed ablation experiments to explore the effect of the CampNet structure on its performance. Finally, the experiment shows that speaker adaptation with only one sentence can further improve the naturalness of speech. Examples of generated speech can be found at https://hairuo55.github.io/CampNet.

* under review, 14 pages, 14 figures, demo page is available at https://hairuo55.github.io/CampNet

Via

Access Paper or Ask Questions

Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis

Feb 16, 2022

Tao Wang, Ruibo Fu, Jiangyan Yi, Jianhua Tao, Zhengqi Wen

Figure 1 for Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis

Figure 2 for Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis

Figure 3 for Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis

Figure 4 for Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis

Abstract:End-to-end singing voice synthesis (SVS) is attractive due to the avoidance of pre-aligned data. However, the auto learned alignment of singing voice with lyrics is difficult to match the duration information in musical score, which will lead to the model instability or even failure to synthesize voice. To learn accurate alignment information automatically, this paper proposes an end-to-end SVS framework, named Singing-Tacotron. The main difference between the proposed framework and Tacotron is that the speech can be controlled significantly by the musical score's duration information. Firstly, we propose a global duration control attention mechanism for the SVS model. The attention mechanism can control each phoneme's duration. Secondly, a duration encoder is proposed to learn a set of global transition tokens from the musical score. These transition tokens can help the attention mechanism decide whether moving to the next phoneme or staying at each decoding step. Thirdly, to further improve the model's stability, a dynamic filter is designed to help the model overcome noise interference and pay more attention to local context information. Subjective and objective evaluation verify the effectiveness of the method. Furthermore, the role of global transition tokens and the effect of duration control are explored. Examples of experiments can be found at https://hairuo55.github.io/SingingTacotron.

* 5 pages, 7 figures

Via

Access Paper or Ask Questions

FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization

Apr 07, 2021

Zhengkun Tian, Jiangyan Yi, Ye Bai, Jianhua Tao, Shuai Zhang, Zhengqi Wen

Figure 1 for FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization

Figure 2 for FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization

Figure 3 for FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization

Figure 4 for FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization

Abstract:Transducer-based models, such as RNN-Transducer and transformer-transducer, have achieved great success in speech recognition. A typical transducer model decodes the output sequence conditioned on the current acoustic state and previously predicted tokens step by step. Statistically, The number of blank tokens in the prediction results accounts for nearly 90\% of all tokens. It takes a lot of computation and time to predict the blank tokens, but only the non-blank tokens will appear in the final output sequence. Therefore, we propose a method named fast-skip regularization, which tries to align the blank position predicted by a transducer with that predicted by a CTC model. During the inference, the transducer model can predict the blank tokens in advance by a simple CTC project layer without many complicated forward calculations of the transducer decoder and then skip them, which will reduce the computation and improve the inference speed greatly. All experiments are conducted on a public Chinese mandarin dataset AISHELL-1. The results show that the fast-skip regularization can indeed help the transducer model learn the blank position alignments. Besides, the inference with fast-skip can be speeded up nearly 4 times with only a little performance degradation.

* Submitted to INTERSPEECH2021

Via

Access Paper or Ask Questions

TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech Recognition

Apr 04, 2021

Zhengkun Tian, Jiangyan Yi, Jianhua Tao, Ye Bai, Shuai Zhang, Zhengqi Wen, Xuefei Liu

Figure 1 for TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech Recognition

Figure 2 for TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech Recognition

Figure 3 for TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech Recognition

Figure 4 for TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech Recognition

Abstract:The autoregressive (AR) models, such as attention-based encoder-decoder models and RNN-Transducer, have achieved great success in speech recognition. They predict the output sequence conditioned on the previous tokens and acoustic encoded states, which is inefficient on GPUs. The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step. However, the NAR model still faces two major problems. On the one hand, there is still a great gap in performance between the NAR models and the advanced AR models. On the other hand, it's difficult for most of the NAR models to train and converge. To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT), which improves the performance and accelerating the convergence of the NAR model by learning prior knowledge from a parameters-sharing AR model. Furthermore, we introduce the two-stage method into the inference process, which improves the model performance greatly. All the experiments are conducted on a public Chinese mandarin dataset ASIEHLL-1. The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.

* Submitted to Interspeech2021

Via

Access Paper or Ask Questions