Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xulong Zhang

CycleFlow: Leveraging Cycle Consistency in Flow Matching for Speaker Style Adaptation

Jan 03, 2025

Ziqi Liang, Xulong Zhang, Chang Liu, Xiaoyang Qu, Weifeng Zhao, Jianzong Wang

Abstract:Voice Conversion (VC) aims to convert the style of a source speaker, such as timbre and pitch, to the style of any target speaker while preserving the linguistic content. However, the ground truth of the converted speech does not exist in a non-parallel VC scenario, which induces the train-inference mismatch problem. Moreover, existing methods still have an inaccurate pitch and low speaker adaptation quality, there is a significant disparity in pitch between the source and target speaker style domains. As a result, the models tend to generate speech with hoarseness, posing challenges in achieving high-quality voice conversion. In this study, we propose CycleFlow, a novel VC approach that leverages cycle consistency in conditional flow matching (CFM) for speaker timbre adaptation training on non-parallel data. Furthermore, we design a Dual-CFM based on VoiceCFM and PitchCFM to generate speech and improve speaker pitch adaptation quality. Experiments show that our method can significantly improve speaker similarity, generating natural and higher-quality speech.

* Accepted by 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2025)

Via

Access Paper or Ask Questions

ESARM: 3D Emotional Speech-to-Animation via Reward Model from Automatically-Ranked Demonstrations

Nov 20, 2024

Xulong Zhang, Xiaoyang Qu, Haoxiang Shi, Chunguang Xiao, Jianzong Wang

Abstract:This paper proposes a novel 3D speech-to-animation (STA) generation framework designed to address the shortcomings of existing models in producing diverse and emotionally resonant animations. Current STA models often generate animations that lack emotional depth and variety, failing to align with human expectations. To overcome these limitations, we introduce a novel STA model coupled with a reward model. This combination enables the decoupling of emotion and content under audio conditions through a cross-coupling training approach. Additionally, we develop a training methodology that leverages automatic quality evaluation of generated facial animations to guide the reinforcement learning process. This methodology encourages the STA model to explore a broader range of possibilities, resulting in the generation of diverse and emotionally expressive facial animations of superior quality. We conduct extensive empirical experiments on a benchmark dataset, and the results validate the effectiveness of our proposed framework in generating high-quality, emotionally rich 3D animations that are better aligned with human preferences.

* Accepted by the 26th IEEE International Conference on High Performance Computing and Communications (HPCC2024)

Via

Access Paper or Ask Questions

Semi-Supervised Self-Learning Enhanced Music Emotion Recognition

Oct 29, 2024

Yifu Sun, Xulong Zhang, Monan Zhou, Wei Li

Figure 1 for Semi-Supervised Self-Learning Enhanced Music Emotion Recognition

Figure 2 for Semi-Supervised Self-Learning Enhanced Music Emotion Recognition

Abstract:Music emotion recognition (MER) aims to identify the emotions conveyed in a given musical piece. But currently in the field of MER, the available public datasets have limited sample sizes. Recently, segment-based methods for emotion-related tasks have been proposed, which train backbone networks on shorter segments instead of entire audio clips, thereby naturally augmenting training samples without requiring additional resources. Then, the predicted segment-level results are aggregated to obtain the entire song prediction. The most commonly used method is that segment inherits the label of the clip containing it, but music emotion is not constant during the whole clip. Doing so will introduce label noise and make the training overfit easily. To handle the noisy label issue, we propose a semi-supervised self-learning (SSSL) method, which can differentiate between samples with correct and incorrect labels in a self-learning manner, thus effectively utilizing the augmented segment-level data. Experiments on three public emotional datasets demonstrate that the proposed method can achieve better or comparable performance.

Via

Access Paper or Ask Questions

IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding

Sep 29, 2024

Pengcheng Li, Xulong Zhang, Jing Xiao, Jianzong Wang

Figure 1 for IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding

Figure 2 for IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding

Figure 3 for IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding

Figure 4 for IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding

Abstract:The audio watermarking technique embeds messages into audio and accurately extracts messages from the watermarked audio. Traditional methods develop algorithms based on expert experience to embed watermarks into the time-domain or transform-domain of signals. With the development of deep neural networks, deep learning-based neural audio watermarking has emerged. Compared to traditional algorithms, neural audio watermarking achieves better robustness by considering various attacks during training. However, current neural watermarking methods suffer from low capacity and unsatisfactory imperceptibility. Additionally, the issue of watermark locating, which is extremely important and even more pronounced in neural audio watermarking, has not been adequately studied. In this paper, we design a dual-embedding watermarking model for efficient locating. We also consider the impact of the attack layer on the invertible neural network in robustness training, improving the model to enhance both its reasonableness and stability. Experiments show that the proposed model, IDEAW, can withstand various attacks with higher capacity and more efficient locating ability compared to existing methods.

* Accepted by the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024)

Via

Access Paper or Ask Questions

Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning

May 28, 2024

Haoxiang Shi, Xulong Zhang, Ning Cheng, Yong Zhang, Jun Yu, Jing Xiao, Jianzong Wang

Figure 1 for Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning

Figure 2 for Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning

Figure 3 for Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning

Figure 4 for Enhancing Emotion Recognition in Conversation through Emotional Cross-Modal Fusion and Inter-class Contrastive Learning

Abstract:The purpose of emotion recognition in conversation (ERC) is to identify the emotion category of an utterance based on contextual information. Previous ERC methods relied on simple connections for cross-modal fusion and ignored the information differences between modalities, resulting in the model being unable to focus on modality-specific emotional information. At the same time, the shared information between modalities was not processed to generate emotions. Information redundancy problem. To overcome these limitations, we propose a cross-modal fusion emotion prediction network based on vector connections. The network mainly includes two stages: the multi-modal feature fusion stage based on connection vectors and the emotion classification stage based on fused features. Furthermore, we design a supervised inter-class contrastive learning module based on emotion labels. Experimental results confirm the effectiveness of the proposed method, demonstrating excellent performance on the IEMOCAP and MELD datasets.

* Accepted by the 20th International Conference on Intelligent Computing (ICIC 2024)

Via

Access Paper or Ask Questions

RREH: Reconstruction Relations Embedded Hashing for Semi-Paired Cross-Modal Retrieval

May 28, 2024

Jianzong Wang, Haoxiang Shi, Kaiyi Luo, Xulong Zhang, Ning Cheng, Jing Xiao

Figure 1 for RREH: Reconstruction Relations Embedded Hashing for Semi-Paired Cross-Modal Retrieval

Figure 2 for RREH: Reconstruction Relations Embedded Hashing for Semi-Paired Cross-Modal Retrieval

Figure 3 for RREH: Reconstruction Relations Embedded Hashing for Semi-Paired Cross-Modal Retrieval

Figure 4 for RREH: Reconstruction Relations Embedded Hashing for Semi-Paired Cross-Modal Retrieval

Abstract:Known for efficient computation and easy storage, hashing has been extensively explored in cross-modal retrieval. The majority of current hashing models are predicated on the premise of a direct one-to-one mapping between data points. However, in real practice, data correspondence across modalities may be partially provided. In this research, we introduce an innovative unsupervised hashing technique designed for semi-paired cross-modal retrieval tasks, named Reconstruction Relations Embedded Hashing (RREH). RREH assumes that multi-modal data share a common subspace. For paired data, RREH explores the latent consistent information of heterogeneous modalities by seeking a shared representation. For unpaired data, to effectively capture the latent discriminative features, the high-order relationships between unpaired data and anchors are embedded into the latent subspace, which are computed by efficient linear reconstruction. The anchors are sampled from paired data, which improves the efficiency of hash learning. The RREH trains the underlying features and the binary encodings in a unified framework with high-order reconstruction relations preserved. With the well devised objective function and discrete optimization algorithm, RREH is designed to be scalable, making it suitable for large-scale datasets and facilitating efficient cross-modal retrieval. In the evaluation process, the proposed is tested with partially paired data to establish its superiority over several existing methods.

* Accepted by the 20th International Conference on Intelligent Computing (ICIC 2024)

Via

Access Paper or Ask Questions

RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

May 27, 2024

Haoxiang Shi, Jianzong Wang, Xulong Zhang, Ning Cheng, Jun Yu, Jing Xiao

Figure 1 for RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

Figure 2 for RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

Figure 3 for RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

Figure 4 for RSET: Remapping-based Sorting Method for Emotion Transfer Speech Synthesis

Abstract:Although current Text-To-Speech (TTS) models are able to generate high-quality speech samples, there are still challenges in developing emotion intensity controllable TTS. Most existing TTS models achieve emotion intensity control by extracting intensity information from reference speeches. Unfortunately, limited by the lack of modeling for intra-class emotion intensity and the model's information decoupling capability, the generated speech cannot achieve fine-grained emotion intensity control and suffers from information leakage issues. In this paper, we propose an emotion transfer TTS model, which defines a remapping-based sorting method to model intra-class relative intensity information, combined with Mutual Information (MI) to decouple speaker and emotion information, and synthesizes expressive speeches with perceptible intensity differences. Experiments show that our model achieves fine-grained emotion control while preserving speaker information.

* Accepted by the 8th APWeb-WAIM International Joint Conference on Web and Big Data

Via

Access Paper or Ask Questions

MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion

May 02, 2024

Pengcheng Li, Jianzong Wang, Xulong Zhang, Yong Zhang, Jing Xiao, Ning Cheng

Figure 1 for MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion

Figure 2 for MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion

Figure 3 for MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion

Figure 4 for MAIN-VC: Lightweight Speech Representation Disentanglement for One-shot Voice Conversion

Abstract:One-shot voice conversion aims to change the timbre of any source speech to match that of the unseen target speaker with only one speech sample. Existing methods face difficulties in satisfactory speech representation disentanglement and suffer from sizable networks as some of them leverage numerous complex modules for disentanglement. In this paper, we propose a model named MAIN-VC to effectively disentangle via a concise neural network. The proposed model utilizes Siamese encoders to learn clean representations, further enhanced by the designed mutual information estimator. The Siamese structure and the newly designed convolution module contribute to the lightweight of our model while ensuring performance in diverse voice conversion tasks. The experimental results show that the proposed model achieves comparable subjective scores and exhibits improvements in objective metrics compared to existing methods in a one-shot voice conversion scenario.

* Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)

Via

Access Paper or Ask Questions

Learning Expressive Disentangled Speech Representations with Soft Speech Units and Adversarial Style Augmentation

May 01, 2024

Yimin Deng, Jianzong Wang, Xulong Zhang, Ning Cheng, Jing Xiao

Abstract:Voice conversion is the task to transform voice characteristics of source speech while preserving content information. Nowadays, self-supervised representation learning models are increasingly utilized in content extraction. However, in these representations, a lot of hidden speaker information leads to timbre leakage while the prosodic information of hidden units lacks use. To address these issues, we propose a novel framework for expressive voice conversion called "SAVC" based on soft speech units from HuBert-soft. Taking soft speech units as input, we design an attribute encoder to extract content and prosody features respectively. Specifically, we first introduce statistic perturbation imposed by adversarial style augmentation to eliminate speaker information. Then the prosody is implicitly modeled on soft speech units with knowledge distillation. Experiment results show that the intelligibility and naturalness of converted speech outperform previous work.

* Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)

Via

Access Paper or Ask Questions

EfficientASR: Speech Recognition Network Compression via Attention Redundancy and Chunk-Level FFN Optimization

Apr 30, 2024

Jianzong Wang, Ziqi Liang, Xulong Zhang, Ning Cheng, Jing Xiao

Abstract:In recent years, Transformer networks have shown remarkable performance in speech recognition tasks. However, their deployment poses challenges due to high computational and storage resource requirements. To address this issue, a lightweight model called EfficientASR is proposed in this paper, aiming to enhance the versatility of Transformer models. EfficientASR employs two primary modules: Shared Residual Multi-Head Attention (SRMHA) and Chunk-Level Feedforward Networks (CFFN). The SRMHA module effectively reduces redundant computations in the network, while the CFFN module captures spatial knowledge and reduces the number of parameters. The effectiveness of the EfficientASR model is validated on two public datasets, namely Aishell-1 and HKUST. Experimental results demonstrate a 36% reduction in parameters compared to the baseline Transformer network, along with improvements of 0.3% and 0.2% in Character Error Rate (CER) on the Aishell-1 and HKUST datasets, respectively.

* Accepted by the 2024 International Joint Conference on Neural Networks (IJCNN 2024)

Via

Access Paper or Ask Questions