Alert button
Picture for Ruibo Fu

Ruibo Fu

Alert button

Text Enhancement for Paragraph Processing in End-to-End Code-switching TTS

Oct 20, 2022
Chunyu Qiang, Jianhua Tao, Ruibo Fu, Zhengqi Wen, Jiangyan Yi, Tao Wang, Shiming Wang

Current end-to-end code-switching Text-to-Speech (TTS) can already generate high quality two languages speech in the same utterance with single speaker bilingual corpora. When the speakers of the bilingual corpora are different, the naturalness and consistency of the code-switching TTS will be poor. The cross-lingual embedding layers structure we proposed makes similar syllables in different languages relevant, thus improving the naturalness and consistency of generated speech. In the end-to-end code-switching TTS, there exists problem of prosody instability when synthesizing paragraph text. The text enhancement method we proposed makes the input contain prosodic information and sentence-level context information, thus improving the prosody stability of paragraph text. Experimental results demonstrate the effectiveness of the proposed methods in the naturalness, consistency, and prosody stability. In addition to Mandarin and English, we also apply these methods to Shanghaiese and Cantonese corpora, proving that the methods we proposed can be extended to other languages to build end-to-end code-switching TTS system.

* accepted in ISCSLP 2021 
Viaarxiv icon

An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era

Oct 06, 2022
Andreas Triantafyllopoulos, Björn W. Schuller, Gökçe İymen, Metin Sezgin, Xiangheng He, Zijiang Yang, Panagiotis Tzirakis, Shuo Liu, Silvan Mertes, Elisabeth André, Ruibo Fu, Jianhua Tao

Figure 1 for An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era
Figure 2 for An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era
Figure 3 for An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era
Figure 4 for An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era

Speech is the fundamental mode of human communication, and its synthesis has long been a core priority in human-computer interaction research. In recent years, machines have managed to master the art of generating speech that is understandable by humans. But the linguistic content of an utterance encompasses only a part of its meaning. Affect, or expressivity, has the capacity to turn speech into a medium capable of conveying intimate thoughts, feelings, and emotions -- aspects that are essential for engaging and naturalistic interpersonal communication. While the goal of imparting expressivity to synthesised utterances has so far remained elusive, following recent advances in text-to-speech synthesis, a paradigm shift is well under way in the fields of affective speech synthesis and conversion as well. Deep learning, as the technology which underlies most of the recent advances in artificial intelligence, is spearheading these efforts. In the present overview, we outline ongoing trends and summarise state-of-the-art approaches in an attempt to provide a comprehensive overview of this exciting field.

* Submitted to the Proceedings of IEEE 
Viaarxiv icon

System Fingerprints Detection for DeepFake Audio: An Initial Dataset and Investigation

Aug 21, 2022
Xinrui Yan, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Haoxin Ma, Zhengkun Tian, Ruibo Fu

Figure 1 for System Fingerprints Detection for DeepFake Audio: An Initial Dataset and Investigation
Figure 2 for System Fingerprints Detection for DeepFake Audio: An Initial Dataset and Investigation
Figure 3 for System Fingerprints Detection for DeepFake Audio: An Initial Dataset and Investigation
Figure 4 for System Fingerprints Detection for DeepFake Audio: An Initial Dataset and Investigation

Many effective attempts have been made for deepfake audio detection. However, they can only distinguish between real and fake. For many practical application scenarios, what tool or algorithm generated the deepfake audio also is needed. This raises a question: Can we detect the system fingerprints of deepfake audio? Therefore, this paper conducts a preliminary investigation to detect system fingerprints of deepfake audio. Experiments are conducted on deepfake audio datasets from five latest deep-learning speech synthesis systems. The results show that LFCC features are relatively more suitable for system fingerprints detection. Moreover, the ResNet achieves the best detection results among LCNN and x-vector based models. The t-SNE visualization shows that different speech synthesis systems generate distinct system fingerprints.

* 9 pages, 3 figures. arXiv admin note: text overlap with arXiv:2208.09646 
Viaarxiv icon

An Initial Investigation for Detecting Vocoder Fingerprints of Fake Audio

Aug 20, 2022
Xinrui Yan, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Haoxin Ma, Tao Wang, Shiming Wang, Ruibo Fu

Figure 1 for An Initial Investigation for Detecting Vocoder Fingerprints of Fake Audio
Figure 2 for An Initial Investigation for Detecting Vocoder Fingerprints of Fake Audio
Figure 3 for An Initial Investigation for Detecting Vocoder Fingerprints of Fake Audio
Figure 4 for An Initial Investigation for Detecting Vocoder Fingerprints of Fake Audio

Many effective attempts have been made for fake audio detection. However, they can only provide detection results but no countermeasures to curb this harm. For many related practical applications, what model or algorithm generated the fake audio also is needed. Therefore, We propose a new problem for detecting vocoder fingerprints of fake audio. Experiments are conducted on the datasets synthesized by eight state-of-the-art vocoders. We have preliminarily explored the features and model architectures. The t-SNE visualization shows that different vocoders generate distinct vocoder fingerprints.

* Accepted by ACM Multimedia 2022 Workshop: First International Workshop on Deepfake Detection for Audio Multimedia 
Viaarxiv icon

Fully Automated End-to-End Fake Audio Detection

Aug 20, 2022
Chenglong Wang, Jiangyan Yi, Jianhua Tao, Haiyang Sun, Xun Chen, Zhengkun Tian, Haoxin Ma, Cunhang Fan, Ruibo Fu

Figure 1 for Fully Automated End-to-End Fake Audio Detection
Figure 2 for Fully Automated End-to-End Fake Audio Detection
Figure 3 for Fully Automated End-to-End Fake Audio Detection
Figure 4 for Fully Automated End-to-End Fake Audio Detection

The existing fake audio detection systems often rely on expert experience to design the acoustic features or manually design the hyperparameters of the network structure. However, artificial adjustment of the parameters can have a relatively obvious influence on the results. It is almost impossible to manually set the best set of parameters. Therefore this paper proposes a fully automated end-toend fake audio detection method. We first use wav2vec pre-trained model to obtain a high-level representation of the speech. Furthermore, for the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS. It learns deep speech representations while automatically learning and optimizing complex neural structures consisting of convolutional operations and residual blocks. The experimental results on the ASVspoof 2019 LA dataset show that our proposed system achieves an equal error rate (EER) of 1.08%, which outperforms the state-of-the-art single system.

Viaarxiv icon

NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

Mar 05, 2022
Tao Wang, Ruibo Fu, Jiangyan Yi, Jianhua Tao, Zhengqi Wen

Figure 1 for NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation
Figure 2 for NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation
Figure 3 for NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation
Figure 4 for NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

The traditional vocoders have the advantages of high synthesis efficiency, strong interpretability, and speech editability, while the neural vocoders have the advantage of high synthesis quality. To combine the advantages of two vocoders, inspired by the traditional deterministic plus stochastic model, this paper proposes a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability. Firstly, this framework contains four modules: a deterministic source module, a stochastic source module, a neural V/UV decision module and a neural filter module. The input required by the vocoder is just the spectral parameter, which avoids the error caused by estimating additional parameters, such as F0. Secondly, to solve the problem that different frequency bands may have different proportions of deterministic components and stochastic components, a multiband excitation strategy is used to generate a more accurate excitation signal and reduce the neural filter's burden. Thirdly, a method to control noise components of speech is proposed. In this way, the signal-to-noise ratio (SNR) of speech can be adjusted easily. Objective and subjective experimental results show that our proposed NeuralDPS vocoder can obtain similar performance with the WaveNet and it generates waveforms at least 280 times faster than the WaveNet vocoder. It is also 28% faster than WaveGAN's synthesis efficiency on a single CPU core. We have also verified through experiments that this method can effectively control the noise components in the predicted speech and adjust the SNR of speech. Examples of generated speech can be found at https://hairuo55.github.io/NeuralDPS.

* 15 pages, 12 figures; Accepted to TASLP. Demo page https://hairuo55.github.io/NeuralDPS. arXiv admin note: text overlap with arXiv:1906.09573 by other authors 
Viaarxiv icon

ADD 2022: the First Audio Deep Synthesis Detection Challenge

Feb 26, 2022
Jiangyan Yi, Ruibo Fu, Jianhua Tao, Shuai Nie, Haoxin Ma, Chenglong Wang, Tao Wang, Zhengkun Tian, Ye Bai, Cunhang Fan, Shan Liang, Shiming Wang, Shuai Zhang, Xinrui Yan, Le Xu, Zhengqi Wen, Haizhou Li, Zheng Lian, Bin Liu

Figure 1 for ADD 2022: the First Audio Deep Synthesis Detection Challenge
Figure 2 for ADD 2022: the First Audio Deep Synthesis Detection Challenge
Figure 3 for ADD 2022: the First Audio Deep Synthesis Detection Challenge
Figure 4 for ADD 2022: the First Audio Deep Synthesis Detection Challenge

Audio deepfake detection is an emerging topic, which was included in the ASVspoof 2021. However, the recent shared tasks have not covered many real-life and challenging scenarios. The first Audio Deep synthesis Detection challenge (ADD) was motivated to fill in the gap. The ADD 2022 includes three tracks: low-quality fake audio detection (LF), partially fake audio detection (PF) and audio fake game (FG). The LF track focuses on dealing with bona fide and fully fake utterances with various real-world noises etc. The PF track aims to distinguish the partially fake audio from the real. The FG track is a rivalry game, which includes two tasks: an audio generation task and an audio fake detection task. In this paper, we describe the datasets, evaluation metrics, and protocols. We also report major findings that reflect the recent advances in audio deepfake detection tasks.

* Accepted by ICASSP 2022 
Viaarxiv icon

CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

Feb 21, 2022
Tao Wang, Jiangyan Yi, Ruibo Fu, Jianhua Tao, Zhengqi Wen

Figure 1 for CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing
Figure 2 for CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing
Figure 3 for CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing
Figure 4 for CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

The text-based speech editor allows the editing of speech through intuitive cutting, copying, and pasting operations to speed up the process of editing speech. However, the major drawback of current systems is that edited speech often sounds unnatural due to cut-copy-paste operation. In addition, it is not obvious how to synthesize records according to a new word not appearing in the transcript. This paper proposes a novel end-to-end text-based speech editing method called context-aware mask prediction network (CampNet). The model can simulate the text-based speech editing process by randomly masking part of speech and then predicting the masked region by sensing the speech context. It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript. Secondly, for the possible operation of text-based speech editing, we design three text-based operations based on CampNet: deletion, insertion, and replacement. These operations can cover various situations of speech editing. Thirdly, to synthesize the speech corresponding to long text in insertion and replacement operations, a word-level autoregressive generation method is proposed. Fourthly, we propose a speaker adaptation method using only one sentence for CampNet and explore the ability of few-shot learning based on CampNet, which provides a new idea for speech forgery tasks. The subjective and objective experiments on VCTK and LibriTTS datasets show that the speech editing results based on CampNet are better than TTS technology, manual editing, and VoCo method. We also conduct detailed ablation experiments to explore the effect of the CampNet structure on its performance. Finally, the experiment shows that speaker adaptation with only one sentence can further improve the naturalness of speech. Examples of generated speech can be found at https://hairuo55.github.io/CampNet.

* under review, 14 pages, 14 figures, demo page is available at https://hairuo55.github.io/CampNet 
Viaarxiv icon

Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis

Feb 16, 2022
Tao Wang, Ruibo Fu, Jiangyan Yi, Jianhua Tao, Zhengqi Wen

Figure 1 for Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis
Figure 2 for Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis
Figure 3 for Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis
Figure 4 for Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis

End-to-end singing voice synthesis (SVS) is attractive due to the avoidance of pre-aligned data. However, the auto learned alignment of singing voice with lyrics is difficult to match the duration information in musical score, which will lead to the model instability or even failure to synthesize voice. To learn accurate alignment information automatically, this paper proposes an end-to-end SVS framework, named Singing-Tacotron. The main difference between the proposed framework and Tacotron is that the speech can be controlled significantly by the musical score's duration information. Firstly, we propose a global duration control attention mechanism for the SVS model. The attention mechanism can control each phoneme's duration. Secondly, a duration encoder is proposed to learn a set of global transition tokens from the musical score. These transition tokens can help the attention mechanism decide whether moving to the next phoneme or staying at each decoding step. Thirdly, to further improve the model's stability, a dynamic filter is designed to help the model overcome noise interference and pay more attention to local context information. Subjective and objective evaluation verify the effectiveness of the method. Furthermore, the role of global transition tokens and the effect of duration control are explored. Examples of experiments can be found at https://hairuo55.github.io/SingingTacotron.

* 5 pages, 7 figures 
Viaarxiv icon