Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ning Cheng

PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

Aug 21, 2023

Yimin Deng, Huaizhen Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

Abstract:Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosody, pace, pause, etc. In this context, prosody modeling is crucial for achieving expressive voice conversion that sounds natural and convincing. Unfortunately, prosody modeling is important but challenging, especially without text transcriptions. In this paper, we firstly propose a novel voice conversion framework named 'PMVC', which effectively separates and models the content, timbre, and prosodic information from the speech without text transcriptions. Specially, we introduce a new speech augmentation algorithm for robust prosody extraction. And building upon this, mask and predict mechanism is applied in the disentanglement of prosody and content information. The experimental results on the AIShell-3 corpus supports our improvement of naturalness and similarity of converted speech.

* Accepted by the 31st ACM International Conference on Multimedia (MM2023)

Via

Access Paper or Ask Questions

Prompt Guided Copy Mechanism for Conversational Question Answering

Aug 07, 2023

Yong Zhang, Zhitao Li, Jianzong Wang, Yiming Gao, Ning Cheng, Fengying Yu, Jing Xiao

Figure 1 for Prompt Guided Copy Mechanism for Conversational Question Answering

Figure 2 for Prompt Guided Copy Mechanism for Conversational Question Answering

Figure 3 for Prompt Guided Copy Mechanism for Conversational Question Answering

Figure 4 for Prompt Guided Copy Mechanism for Conversational Question Answering

Abstract:Conversational Question Answering (CQA) is a challenging task that aims to generate natural answers for conversational flow questions. In this paper, we propose a pluggable approach for extractive methods that introduces a novel prompt-guided copy mechanism to improve the fluency and appropriateness of the extracted answers. Our approach uses prompts to link questions to answers and employs attention to guide the copy mechanism to verify the naturalness of extracted answers, making necessary edits to ensure that the answers are fluent and appropriate. The three prompts, including a question-rationale relationship prompt, a question description prompt, and a conversation history prompt, enhance the copy mechanism's performance. Our experiments demonstrate that this approach effectively promotes the generation of natural answers and achieves good results in the CoQA challenge.

* Accepted by 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023)

Via

Access Paper or Ask Questions

Boosting Chinese ASR Error Correction with Dynamic Error Scaling Mechanism

Aug 07, 2023

Jiaxin Fan, Yong Zhang, Hanzhang Li, Jianzong Wang, Zhitao Li, Sheng Ouyang, Ning Cheng, Jing Xiao

Abstract:Chinese Automatic Speech Recognition (ASR) error correction presents significant challenges due to the Chinese language's unique features, including a large character set and borderless, morpheme-based structure. Current mainstream models often struggle with effectively utilizing word-level features and phonetic information. This paper introduces a novel approach that incorporates a dynamic error scaling mechanism to detect and correct phonetically erroneous text generated by ASR output. This mechanism operates by dynamically fusing word-level features and phonetic information, thereby enriching the model with additional semantic data. Furthermore, our method implements unique error reduction and amplification strategies to address the issues of matching wrong words caused by incorrect characters. Experimental results indicate substantial improvements in ASR error correction, demonstrating the effectiveness of our proposed method and yielding promising results on established datasets.

* Accepted by 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023)

Via

Access Paper or Ask Questions

CollabKG: A Learnable Human-Machine-Cooperative Information Extraction Toolkit for (Event) Knowledge Graph Construction

Jul 03, 2023

Xiang Wei, Yufeng Chen, Ning Cheng, Xingyu Cui, Jinan Xu, Wenjuan Han

Figure 1 for CollabKG: A Learnable Human-Machine-Cooperative Information Extraction Toolkit for (Event) Knowledge Graph Construction

Figure 2 for CollabKG: A Learnable Human-Machine-Cooperative Information Extraction Toolkit for (Event) Knowledge Graph Construction

Figure 3 for CollabKG: A Learnable Human-Machine-Cooperative Information Extraction Toolkit for (Event) Knowledge Graph Construction

Figure 4 for CollabKG: A Learnable Human-Machine-Cooperative Information Extraction Toolkit for (Event) Knowledge Graph Construction

Abstract:In order to construct or extend entity-centric and event-centric knowledge graphs (KG and EKG), the information extraction (IE) annotation toolkit is essential. However, existing IE toolkits have several non-trivial problems, such as not supporting multi-tasks, not supporting automatic updates. In this work, we present CollabKG, a learnable human-machine-cooperative IE toolkit for KG and EKG construction. Specifically, for the multi-task issue, CollabKG unifies different IE subtasks, including named entity recognition (NER), entity-relation triple extraction (RE), and event extraction (EE), and supports both KG and EKG. Then, combining advanced prompting-based IE technology, the human-machine-cooperation mechanism with LLMs as the assistant machine is presented which can provide a lower cost as well as a higher performance. Lastly, owing to the two-way interaction between the human and machine, CollabKG with learning ability allows self-renewal. Besides, CollabKG has several appealing features (e.g., customization, training-free, propagation, etc.) that make the system powerful, easy-to-use, and high-productivity. We holistically compare our toolkit with other existing tools on these features. Human evaluation quantitatively illustrates that CollabKG significantly improves annotation quality, efficiency, and stability simultaneously.

Via

Access Paper or Ask Questions

EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

Jun 01, 2023

Haobin Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

Figure 1 for EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

Figure 2 for EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

Figure 3 for EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

Figure 4 for EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis

Abstract:There has been significant progress in emotional Text-To-Speech (TTS) synthesis technology in recent years. However, existing methods primarily focus on the synthesis of a limited number of emotion types and have achieved unsatisfactory performance in intensity control. To address these limitations, we propose EmoMix, which can generate emotional speech with specified intensity or a mixture of emotions. Specifically, EmoMix is a controllable emotional TTS model based on a diffusion probabilistic model and a pre-trained speech emotion recognition (SER) model used to extract emotion embedding. Mixed emotion synthesis is achieved by combining the noises predicted by diffusion model conditioned on different emotions during only one sampling process at the run-time. We further apply the Neutral and specific primary emotion mixed in varying degrees to control intensity. Experimental results validate the effectiveness of EmoMix for synthesizing mixed emotion and intensity control.

* Accepted by 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023)

Via

Access Paper or Ask Questions

SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

Apr 23, 2023

Jianzong Wang, Xulong Zhang, Haobin Tang, Aolan Sun, Ning Cheng, Jing Xiao

Figure 1 for SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

Figure 2 for SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

Figure 3 for SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

Figure 4 for SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

Abstract:In recent Text-to-Speech (TTS) systems, a neural vocoder often generates speech samples by solely conditioning on acoustic features predicted from an acoustic model. However, there are always distortions existing in the predicted acoustic features, compared to those of the groundtruth, especially in the common case of poor acoustic modeling due to low-quality training data. To overcome such limits, we propose a Self-supervised learning framework to learn an Anti-distortion acoustic Representation (SAR) to replace human-crafted acoustic features by introducing distortion prior to an auto-encoder pre-training process. The learned acoustic representation from the proposed framework is proved anti-distortion compared to the most commonly used mel-spectrogram through both objective and subjective evaluation.

* Accepted by IJCNN2023. 2023 International Joint Conference on Neural Networks (IJCNN2023)

Via

Access Paper or Ask Questions

Efficient Uncertainty Estimation with Gaussian Process for Reliable Dialog Response Retrieval

Mar 15, 2023

Tong Ye, Zhitao Li, Jianzong Wang, Ning Cheng, Jing Xiao

Figure 1 for Efficient Uncertainty Estimation with Gaussian Process for Reliable Dialog Response Retrieval

Figure 2 for Efficient Uncertainty Estimation with Gaussian Process for Reliable Dialog Response Retrieval

Figure 3 for Efficient Uncertainty Estimation with Gaussian Process for Reliable Dialog Response Retrieval

Figure 4 for Efficient Uncertainty Estimation with Gaussian Process for Reliable Dialog Response Retrieval

Abstract:Deep neural networks have achieved remarkable performance in retrieval-based dialogue systems, but they are shown to be ill calibrated. Though basic calibration methods like Monte Carlo Dropout and Ensemble can calibrate well, these methods are time-consuming in the training or inference stages. To tackle these challenges, we propose an efficient uncertainty calibration framework GPF-BERT for BERT-based conversational search, which employs a Gaussian Process layer and the focal loss on top of the BERT architecture to achieve a high-quality neural ranker. Extensive experiments are conducted to verify the effectiveness of our method. In comparison with basic calibration methods, GPF-BERT achieves the lowest empirical calibration error (ECE) in three in-domain datasets and the distributional shift tasks, while yielding the highest $R_{10}@1$ and MAP performance on most cases. In terms of time consumption, our GPF-BERT has an 8$\times$ speedup.

* Accepted by ICASSP 2023

Via

Access Paper or Ask Questions

On the Calibration and Uncertainty with Pólya-Gamma Augmentation for Dialog Retrieval Models

Mar 15, 2023

Tong Ye, Shijing Si, Jianzong Wang, Ning Cheng, Zhitao Li, Jing Xiao

Figure 1 for On the Calibration and Uncertainty with Pólya-Gamma Augmentation for Dialog Retrieval Models

Figure 2 for On the Calibration and Uncertainty with Pólya-Gamma Augmentation for Dialog Retrieval Models

Figure 3 for On the Calibration and Uncertainty with Pólya-Gamma Augmentation for Dialog Retrieval Models

Figure 4 for On the Calibration and Uncertainty with Pólya-Gamma Augmentation for Dialog Retrieval Models

Abstract:Deep neural retrieval models have amply demonstrated their power but estimating the reliability of their predictions remains challenging. Most dialog response retrieval models output a single score for a response on how relevant it is to a given question. However, the bad calibration of deep neural network results in various uncertainty for the single score such that the unreliable predictions always misinform user decisions. To investigate these issues, we present an efficient calibration and uncertainty estimation framework PG-DRR for dialog response retrieval models which adds a Gaussian Process layer to a deterministic deep neural network and recovers conjugacy for tractable posterior inference by P\'{o}lya-Gamma augmentation. Finally, PG-DRR achieves the lowest empirical calibration error (ECE) in the in-domain datasets and the distributional shift task while keeping $R_{10}@1$ and MAP performance.

* Accepted by AAAI 2023

Via

Access Paper or Ask Questions

Dynamic Alignment Mask CTC: Improved Mask-CTC with Aligned Cross Entropy

Mar 14, 2023

Xulong Zhang, Haobin Tang, Jianzong Wang, Ning Cheng, Jian Luo, Jing Xiao

Figure 1 for Dynamic Alignment Mask CTC: Improved Mask-CTC with Aligned Cross Entropy

Figure 2 for Dynamic Alignment Mask CTC: Improved Mask-CTC with Aligned Cross Entropy

Figure 3 for Dynamic Alignment Mask CTC: Improved Mask-CTC with Aligned Cross Entropy

Figure 4 for Dynamic Alignment Mask CTC: Improved Mask-CTC with Aligned Cross Entropy

Abstract:Because of predicting all the target tokens in parallel, the non-autoregressive models greatly improve the decoding efficiency of speech recognition compared with traditional autoregressive models. In this work, we present dynamic alignment Mask CTC, introducing two methods: (1) Aligned Cross Entropy (AXE), finding the monotonic alignment that minimizes the cross-entropy loss through dynamic programming, (2) Dynamic Rectification, creating new training samples by replacing some masks with model predicted tokens. The AXE ignores the absolute position alignment between prediction and ground truth sentence and focuses on tokens matching in relative order. The dynamic rectification method makes the model capable of simulating the non-mask but possible wrong tokens, even if they have high confidence. Our experiments on WSJ dataset demonstrated that not only AXE loss but also the rectification method could improve the WER performance of Mask CTC.

* Accepted by ICASSP 2023

Via

Access Paper or Ask Questions

QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis

Mar 14, 2023

Haobin Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

Figure 1 for QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis

Figure 2 for QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis

Figure 3 for QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis

Figure 4 for QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis

Abstract:Recent expressive text to speech (TTS) models focus on synthesizing emotional speech, but some fine-grained styles such as intonation are neglected. In this paper, we propose QI-TTS which aims to better transfer and control intonation to further deliver the speaker's questioning intention while transferring emotion from reference speech. We propose a multi-style extractor to extract style embedding from two different levels. While the sentence level represents emotion, the final syllable level represents intonation. For fine-grained intonation control, we use relative attributes to represent intonation intensity at the syllable level.Experiments have validated the effectiveness of QI-TTS for improving intonation expressiveness in emotional speech synthesis.

* Accepted by ICASSP 2023

Via

Access Paper or Ask Questions