Alert button
Picture for Guanglai Gao

Guanglai Gao

Alert button

Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion

May 25, 2023
Rui Liu, Jinhua Zhang, Guanglai Gao, Haizhou Li

Figure 1 for Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion
Figure 2 for Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion
Figure 3 for Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion
Figure 4 for Betray Oneself: A Novel Audio DeepFake Detection Model via Mono-to-Stereo Conversion

Audio Deepfake Detection (ADD) aims to detect the fake audio generated by text-to-speech (TTS), voice conversion (VC) and replay, etc., which is an emerging topic. Traditionally we take the mono signal as input and focus on robust feature extraction and effective classifier design. However, the dual-channel stereo information in the audio signal also includes important cues for deepfake, which has not been studied in the prior work. In this paper, we propose a novel ADD model, termed as M2S-ADD, that attempts to discover audio authenticity cues during the mono-to-stereo conversion process. We first projects the mono to a stereo signal using a pretrained stereo synthesizer, then employs a dual-branch neural architecture to process the left and right channel signals, respectively. In this way, we effectively reveal the artifacts in the fake audio, thus improve the ADD performance. The experiments on the ASVspoof2019 database show that M2S-ADD outperforms all baselines that input mono. We release the source code at \url{https://github.com/AI-S2-Lab/M2S-ADD}.

* To appear at InterSpeech2023 
Viaarxiv icon

Explicit Intensity Control for Accented Text-to-speech

Oct 27, 2022
Rui Liu, Haolin Zuo, De Hu, Guanglai Gao, Haizhou Li

Figure 1 for Explicit Intensity Control for Accented Text-to-speech
Figure 2 for Explicit Intensity Control for Accented Text-to-speech
Figure 3 for Explicit Intensity Control for Accented Text-to-speech

Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). How to control the intensity of accent in the process of TTS is a very interesting research direction, and has attracted more and more attention. Recent work design a speaker-adversarial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity. However, such a control method lacks interpretability, and there is no direct correlation between the controlling factor and natural accent intensity. To this end, this paper propose a new intuitive and explicit accent intensity control scheme for accented TTS. Specifically, we first extract the posterior probability, called as ``goodness of pronunciation (GoP)'' from the L1 speech recognition model to quantify the phoneme accent intensity for accented speech, then design a FastSpeech2 based TTS model, named Ai-TTS, to take the accent intensity expression into account during speech generation. Experiments show that the our method outperforms the baseline model in terms of accent rendering and intensity control.

* 5 pages, 3 figures. Submitted to ICASSP 2023. arXiv admin note: text overlap with arXiv:2209.10804 
Viaarxiv icon

FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis

Oct 27, 2022
Yifan Hu, Rui Liu, Guanglai Gao, Haizhou Li

Figure 1 for FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis
Figure 2 for FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis
Figure 3 for FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis
Figure 4 for FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis

Conversational Text-to-Speech (TTS) aims to synthesis an utterance with the right linguistic and affective prosody in a conversational context. The correlation between the current utterance and the dialogue history at the utterance level was used to improve the expressiveness of synthesized speech. However, the fine-grained information in the dialogue history at the word level also has an important impact on the prosodic expression of an utterance, which has not been well studied in the prior work. Therefore, we propose a novel expressive conversational TTS model, termed as FCTalker, that learn the fine and coarse grained context dependency at the same time during speech generation. Specifically, the FCTalker includes fine and coarse grained encoders to exploit the word and utterance-level context dependency. To model the word-level dependencies between an utterance and its dialogue history, the fine-grained dialogue encoder is built on top of a dialogue BERT model. The experimental results show that the proposed method outperforms all baselines and generates more expressive speech that is contextually appropriate. We release the source code at: https://github.com/walker-hyf/FCTalker.

* 5 pages, 4 figures, 1 table. Submitted to ICASSP 2023. We release the source code at: https://github.com/walker-hyf/FCTalker 
Viaarxiv icon

Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities

Oct 27, 2022
Haolin Zuo, Rui Liu, Jinming Zhao, Guanglai Gao, Haizhou Li

Figure 1 for Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities
Figure 2 for Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities
Figure 3 for Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities
Figure 4 for Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities

Multimodal emotion recognition leverages complementary information across modalities to gain performance. However, we cannot guarantee that the data of all modalities are always present in practice. In the studies to predict the missing data across modalities, the inherent difference between heterogeneous modalities, namely the modality gap, presents a challenge. To address this, we propose to use invariant features for a missing modality imagination network (IF-MMIN) which includes two novel mechanisms: 1) an invariant feature learning strategy that is based on the central moment discrepancy (CMD) distance under the full-modality scenario; 2) an invariant feature based imagination module (IF-IM) to alleviate the modality gap during the missing modalities prediction, thus improving the robustness of multimodal joint representation. Comprehensive experiments on the benchmark dataset IEMOCAP demonstrate that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions. We release the code at: https://github.com/ZhuoYulang/IF-MMIN.

* 5 pages, 3 figures, 1 table. Submitted to ICASSP 2023. We release the code at: https://github.com/ZhuoYulang/IF-MMIN 
Viaarxiv icon

A Deep Investigation of RNN and Self-attention for the Cyrillic-Traditional Mongolian Bidirectional Conversion

Sep 24, 2022
Muhan Na, Rui Liu, Feilong, Guanglai Gao

Figure 1 for A Deep Investigation of RNN and Self-attention for the Cyrillic-Traditional Mongolian Bidirectional Conversion
Figure 2 for A Deep Investigation of RNN and Self-attention for the Cyrillic-Traditional Mongolian Bidirectional Conversion
Figure 3 for A Deep Investigation of RNN and Self-attention for the Cyrillic-Traditional Mongolian Bidirectional Conversion
Figure 4 for A Deep Investigation of RNN and Self-attention for the Cyrillic-Traditional Mongolian Bidirectional Conversion

Cyrillic and Traditional Mongolian are the two main members of the Mongolian writing system. The Cyrillic-Traditional Mongolian Bidirectional Conversion (CTMBC) task includes two conversion processes, including Cyrillic Mongolian to Traditional Mongolian (C2T) and Traditional Mongolian to Cyrillic Mongolian conversions (T2C). Previous researchers adopted the traditional joint sequence model, since the CTMBC task is a natural Sequence-to-Sequence (Seq2Seq) modeling problem. Recent studies have shown that Recurrent Neural Network (RNN) and Self-attention (or Transformer) based encoder-decoder models have shown significant improvement in machine translation tasks between some major languages, such as Mandarin, English, French, etc. However, an open problem remains as to whether the CTMBC quality can be improved by utilizing the RNN and Transformer models. To answer this question, this paper investigates the utility of these two powerful techniques for CTMBC task combined with agglutinative characteristics of Mongolian language. We build the encoder-decoder based CTMBC model based on RNN and Transformer respectively and compare the different network configurations deeply. The experimental results show that both RNN and Transformer models outperform the traditional joint sequence model, where the Transformer achieves the best performance. Compared with the joint sequence baseline, the word error rate (WER) of the Transformer for C2T and T2C decreased by 5.72\% and 5.06\% respectively.

* Accepted at The 29th International Conference on Neural Information Processing (ICONIP 2022) 
Viaarxiv icon

MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline

Sep 22, 2022
Yifan Hu, Pengkai Yin, Rui Liu, Feilong Bao, Guanglai Gao

Figure 1 for MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline
Figure 2 for MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline
Figure 3 for MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline
Figure 4 for MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline

This paper introduces a high-quality open-source text-to-speech (TTS) synthesis dataset for Mongolian, a low-resource language spoken by over 10 million people worldwide. The dataset, named MnTTS, consists of about 8 hours of transcribed audio recordings spoken by a 22-year-old professional female Mongolian announcer. It is the first publicly available dataset developed to promote Mongolian TTS applications in both academia and industry. In this paper, we share our experience by describing the dataset development procedures and faced challenges. To demonstrate the reliability of our dataset, we built a powerful non-autoregressive baseline system based on FastSpeech2 model and HiFi-GAN vocoder, and evaluated it using the subjective mean opinion score (MOS) and real time factor (RTF) metrics. Evaluation results show that the powerful baseline system trained on our dataset achieves MOS above 4 and RTF about $3.30\times10^{-1}$, which makes it applicable for practical use. The dataset, training recipe, and pretrained TTS models are freely available \footnote{\label{github}\url{https://github.com/walker-hyf/MnTTS}}.

* Accepted at the 2022 International Conference on Asian Language Processing (IALP2022) 
Viaarxiv icon

Controllable Accented Text-to-Speech Synthesis

Sep 22, 2022
Rui Liu, Berrak Sisman, Guanglai Gao, Haizhou Li

Figure 1 for Controllable Accented Text-to-Speech Synthesis
Figure 2 for Controllable Accented Text-to-Speech Synthesis
Figure 3 for Controllable Accented Text-to-Speech Synthesis
Figure 4 for Controllable Accented Text-to-Speech Synthesis

Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). Accented TTS synthesis is challenging as L2 is different from L1 in both in terms of phonetic rendering and prosody pattern. Furthermore, there is no easy solution to the control of the accent intensity in an utterance. In this work, we propose a neural TTS architecture, that allows us to control the accent and its intensity during inference. This is achieved through three novel mechanisms, 1) an accent variance adaptor to model the complex accent variance with three prosody controlling factors, namely pitch, energy and duration; 2) an accent intensity modeling strategy to quantify the accent intensity; 3) a consistency constraint module to encourage the TTS system to render the expected accent intensity at a fine level. Experiments show that the proposed system attains superior performance to the baseline models in terms of accent rendering and intensity control. To our best knowledge, this is the first study of accented TTS synthesis with explicit intensity control.

* To be submitted for possible journal publication 
Viaarxiv icon

Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning

Jun 15, 2022
Rui Liu, Berrak Sisman, Björn Schuller, Guanglai Gao, Haizhou Li

Figure 1 for Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning
Figure 2 for Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning
Figure 3 for Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning
Figure 4 for Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning

Emotion classification of speech and assessment of the emotion strength are required in applications such as emotional text-to-speech and voice conversion. The emotion attribute ranking function based on Support Vector Machine (SVM) was proposed to predict emotion strength for emotional speech corpus. However, the trained ranking function doesn't generalize to new domains, which limits the scope of applications, especially for out-of-domain or unseen speech. In this paper, we propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech. This is achieved by the fusion of emotional data from various domains. We follow a multi-task learning network architecture that includes an acoustic encoder, a strength predictor, and an auxiliary emotion predictor. Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech. We release the source codes at: https://github.com/ttslr/StrengthNet.

* To appear in INTERSPEECH 2022. 5 pages, 4 figures. Substantial text overlap with arXiv:2110.03156 
Viaarxiv icon

Guided Training: A Simple Method for Single-channel Speaker Separation

Mar 26, 2021
Hao Li, Xueliang Zhang, Guanglai Gao

Figure 1 for Guided Training: A Simple Method for Single-channel Speaker Separation
Figure 2 for Guided Training: A Simple Method for Single-channel Speaker Separation
Figure 3 for Guided Training: A Simple Method for Single-channel Speaker Separation
Figure 4 for Guided Training: A Simple Method for Single-channel Speaker Separation

Deep learning has shown a great potential for speech separation, especially for speech and non-speech separation. However, it encounters permutation problem for multi-speaker separation where both target and interference are speech. Permutation Invariant training (PIT) was proposed to solve this problem by permuting the order of the multiple speakers. Another way is to use an anchor speech, a short speech of the target speaker, to model the speaker identity. In this paper, we propose a simple strategy to train a long short-term memory (LSTM) model to solve the permutation problem in speaker separation. Specifically, we insert a short speech of target speaker at the beginning of a mixture as guide information. So, the first appearing speaker is defined as the target. Due to the powerful capability on sequence modeling, LSTM can use its memory cells to track and separate target speech from interfering speech. Experimental results show that the proposed training strategy is effective for speaker separation.

* 5 pages 
Viaarxiv icon

Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

Aug 11, 2020
Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li

Figure 1 for Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS
Figure 2 for Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS
Figure 3 for Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS
Figure 4 for Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks. To our best knowledge, this is the first implementation of multi-task learning for Tacotron based TTS with a prosodic phrasing model. Experiments show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.

* To appear in IEEE Signal Processing Letters (SPL) 
Viaarxiv icon