Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiaen Liang

Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis

Jun 05, 2023

Dengfeng Ke, Yayue Deng, Yukang Jia, Jinlong Xue, Qi Luo, Ya Li, Jianqing Sun, Jiaen Liang, Binghuai Lin

Figure 1 for Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis

Figure 2 for Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis

Figure 3 for Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis

Figure 4 for Rhythm-controllable Attention with High Robustness for Long Sentence Speech Synthesis

Abstract:Regressive Text-to-Speech (TTS) system utilizes attention mechanism to generate alignment between text and acoustic feature sequence. Alignment determines synthesis robustness (e.g, the occurence of skipping, repeating, and collapse) and rhythm via duration control. However, current attention algorithms used in speech synthesis cannot control rhythm using external duration information to generate natural speech while ensuring robustness. In this study, we propose Rhythm-controllable Attention (RC-Attention) based on Tracotron2, which improves robustness and naturalness simultaneously. Proposed attention adopts a trainable scalar learned from four kinds of information to achieve rhythm control, which makes rhythm control more robust and natural, even when synthesized sentences are extremely longer than training corpus. We use word errors counting and AB preference test to measure robustness of proposed method and naturalness of synthesized speech, respectively. Results shows that RC-Attention has the lowest word error rate of nearly 0.6%, compared with 11.8% for baseline system. Moreover, nearly 60% subjects prefer to the speech synthesized with RC-Attention to that with Forward Attention, because the former has more natural rhythm.

* 5 pages, 3 figures, Published in: 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)

Via

Access Paper or Ask Questions

M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

May 03, 2023

Jinlong Xue, Yayue Deng, Fengping Wang, Ya Li, Yingming Gao, Jianhua Tao, Jianqing Sun, Jiaen Liang

Figure 1 for M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Figure 2 for M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Figure 3 for M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis

Abstract:Conversational text-to-speech (TTS) aims to synthesize speech with proper prosody of reply based on the historical conversation. However, it is still a challenge to comprehensively model the conversation, and a majority of conversational TTS systems only focus on extracting global information and omit local prosody features, which contain important fine-grained information like keywords and emphasis. Moreover, it is insufficient to only consider the textual features, and acoustic features also contain various prosody information. Hence, we propose M2-CTTS, an end-to-end multi-scale multi-modal conversational text-to-speech system, aiming to comprehensively utilize historical conversation and enhance prosodic expression. More specifically, we design a textual context module and an acoustic context module with both coarse-grained and fine-grained modeling. Experimental results demonstrate that our model mixed with fine-grained context information and additionally considering acoustic features achieves better prosody performance and naturalness in CMOS tests.

* 5 pages, 1 figures, 2 tables. Accepted by ICASSP 2023

Via

Access Paper or Ask Questions

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Mar 26, 2022

Jinlong Xue, Yayue Deng, Yichen Han, Ya Li, Jianqing Sun, Jiaen Liang

Figure 1 for ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Figure 2 for ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Figure 3 for ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Figure 4 for ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

Abstract:In recent years, neural network based methods for multi-speaker text-to-speech synthesis (TTS) have made significant progress. However, the current speaker encoder models used in these methods still cannot capture enough speaker information. In this paper, we focus on accurate speaker encoder modeling and propose an end-to-end method that can generate high-quality speech and better similarity for both seen and unseen speakers. The proposed architecture consists of three separately trained components: a speaker encoder based on the state-of-the-art ECAPA-TDNN model which is derived from speaker verification task, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder. The comparison among different speaker encoder models shows our proposed method can achieve better naturalness and similarity. To efficiently evaluate our synthesized speech, we are the first to adopt deep learning based automatic MOS evaluation methods to assess our results, and these methods show great potential in automatic speech quality assessment.

* 5 pages, 2 figures, submitted to interspeech2022

Via

Access Paper or Ask Questions

Selective Pseudo-labeling and Class-wise Discriminative Fusion for Sound Event Detection

Mar 04, 2022

Yunhao Liang, Yanhua Long, Yijie Li, Jiaen Liang

Figure 1 for Selective Pseudo-labeling and Class-wise Discriminative Fusion for Sound Event Detection

Figure 2 for Selective Pseudo-labeling and Class-wise Discriminative Fusion for Sound Event Detection

Figure 3 for Selective Pseudo-labeling and Class-wise Discriminative Fusion for Sound Event Detection

Figure 4 for Selective Pseudo-labeling and Class-wise Discriminative Fusion for Sound Event Detection

Abstract:In recent years, exploring effective sound separation (SSep) techniques to improve overlapping sound event detection (SED) attracts more and more attention. Creating accurate separation signals to avoid the catastrophic error accumulation during SED model training is very important and challenging. In this study, we first propose a novel selective pseudo-labeling approach, termed SPL, to produce high confidence separated target events from blind sound separation outputs. These target events are then used to fine-tune the original SED model that pre-trained on the sound mixtures in a multi-objective learning style. Then, to further leverage the SSep outputs, a class-wise discriminative fusion is proposed to improve the final SED performances, by combining multiple frame-level event predictions of both sound mixtures and their separated signals. All experiments are performed on the public DCASE 2021 Task 4 dataset, and results show that our approaches significantly outperforms the official baseline, the collar-based F 1, PSDS1 and PSDS2 performances are improved from 44.3%, 37.3% and 54.9% to 46.5%, 44.5% and 75.4%, respectively.

* This article was submitted to Interspeech 2022

Via

Access Paper or Ask Questions

CNN-based Discriminative Training for Domain Compensation in Acoustic Event Detection with Frame-wise Classifier

Mar 26, 2021

Tiantian Tang, Xinyuan Zhou, Yanhua Long, Yijie Li, Jiaen Liang

Figure 1 for CNN-based Discriminative Training for Domain Compensation in Acoustic Event Detection with Frame-wise Classifier

Figure 2 for CNN-based Discriminative Training for Domain Compensation in Acoustic Event Detection with Frame-wise Classifier

Figure 3 for CNN-based Discriminative Training for Domain Compensation in Acoustic Event Detection with Frame-wise Classifier

Figure 4 for CNN-based Discriminative Training for Domain Compensation in Acoustic Event Detection with Frame-wise Classifier

Abstract:Domain mismatch is a noteworthy issue in acoustic event detection tasks, as the target domain data is difficult to access in most real applications. In this study, we propose a novel CNN-based discriminative training framework as a domain compensation method to handle this issue. It uses a parallel CNN-based discriminator to learn a pair of high-level intermediate acoustic representations. Together with a binary discriminative loss, the discriminators are forced to maximally exploit the discrimination of heterogeneous acoustic information in each audio clip with target events, which results in a robust paired representations that can well discriminate the target events and background/domain variations separately. Moreover, to better learn the transient characteristics of target events, a frame-wise classifier is designed to perform the final classification. In addition, a two-stage training with the CNN-based discriminator initialization is further proposed to enhance the system training. All experiments are performed on the DCASE 2018 Task3 datasets. Results show that our proposal significantly outperforms the official baseline on cross-domain conditions in AUC by relative $1.8-12.1$% without any performance degradation on in-domain evaluation conditions.

Via

Access Paper or Ask Questions

Joint Weakly Supervised AT and AED Using Deep Feature Distillation and Adaptive Focal Loss

Mar 23, 2021

Yunhao Liang, Yanhua Long, Yijie Li, Jiaen Liang

Figure 1 for Joint Weakly Supervised AT and AED Using Deep Feature Distillation and Adaptive Focal Loss

Figure 2 for Joint Weakly Supervised AT and AED Using Deep Feature Distillation and Adaptive Focal Loss

Figure 3 for Joint Weakly Supervised AT and AED Using Deep Feature Distillation and Adaptive Focal Loss

Figure 4 for Joint Weakly Supervised AT and AED Using Deep Feature Distillation and Adaptive Focal Loss

Abstract:A good joint training framework is very helpful to improve the performances of weakly supervised audio tagging (AT) and acoustic event detection (AED) simultaneously. In this study, we propose three methods to improve the best teacher-student framework of DCASE2019 Task 4 for both AT and AED tasks. A frame-level target-events based deep feature distillation is first proposed, it aims to leverage the potential of limited strong-labeled data in weakly supervised framework to learn better intermediate feature maps. Then we propose an adaptive focal loss and two-stage training strategy to enable an effective and more accurate model training, in which the contribution of difficult-to-classify and easy-to-classify acoustic events to the total cost function can be automatically adjusted. Furthermore, an event-specific post processing is designed to improve the prediction of target event time-stamps. Our experiments are performed on the public DCASE2019 Task4 dataset, and results show that our approach achieves competitive performances in both AT (49.8% F1-score) and AED (81.2% F1-score) tasks.

Via

Access Paper or Ask Questions