Alert button
Picture for Jianzong Wang

Jianzong Wang

Alert button

Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

Sep 16, 2023
Kaiyi Luo, Xulong Zhang, Jianzong Wang, Huaxiong Li, Ning Cheng, Jing Xiao

Cross-modal retrieval (CMR) has been extensively applied in various domains, such as multimedia search engines and recommendation systems. Most existing CMR methods focus on image-to-text retrieval, whereas audio-to-text retrieval, a less explored domain, has posed a great challenge due to the difficulty to uncover discriminative features from audio clips and texts. Existing studies are restricted in the following two ways: 1) Most researchers utilize contrastive learning to construct a common subspace where similarities among data can be measured. However, they considers only cross-modal transformation, neglecting the intra-modal separability. Besides, the temperature parameter is not adaptively adjusted along with semantic guidance, which degrades the performance. 2) These methods do not take latent representation reconstruction into account, which is essential for semantic alignment. This paper introduces a novel audio-text oriented CMR approach, termed Contrastive Latent Space Reconstruction Learning (CLSR). CLSR improves contrastive representation learning by taking intra-modal separability into account and adopting an adaptive temperature control strategy. Moreover, the latent representation reconstruction modules are embedded into the CMR framework, which improves modal interaction. Experiments in comparison with some state-of-the-art methods on two audio-text datasets have validated the superiority of CLSR.

* Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023) 
Viaarxiv icon

AOSR-Net: All-in-One Sandstorm Removal Network

Sep 16, 2023
Yazhong Si, Xulong Zhang, Fan Yang, Jianzong Wang, Ning Cheng, Jing Xiao

Most existing sandstorm image enhancement methods are based on traditional theory and prior knowledge, which often restrict their applicability in real-world scenarios. In addition, these approaches often adopt a strategy of color correction followed by dust removal, which makes the algorithm structure too complex. To solve the issue, we introduce a novel image restoration model, named all-in-one sandstorm removal network (AOSR-Net). This model is developed based on a re-formulated sandstorm scattering model, which directly establishes the image mapping relationship by integrating intermediate parameters. Such integration scheme effectively addresses the problems of over-enhancement and weak generalization in the field of sand dust image enhancement. Experimental results on synthetic and real-world sandstorm images demonstrate the superiority of the proposed AOSR-Net over state-of-the-art (SOTA) algorithms.

* Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023) 
Viaarxiv icon

FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

Sep 16, 2023
Jianzong Wang, Xulong Zhang, Aolan Sun, Ning Cheng, Jing Xiao

This paper integrates graph-to-sequence into an end-to-end text-to-speech framework for syntax-aware modelling with syntactic information of input text. Specifically, the input text is parsed by a dependency parsing module to form a syntactic graph. The syntactic graph is then encoded by a graph encoder to extract the syntactic hidden information, which is concatenated with phoneme embedding and input to the alignment and flow-based decoding modules to generate the raw audio waveform. The model is experimented on two languages, English and Mandarin, using single-speaker, few samples of target speakers, and multi-speaker datasets, respectively. Experimental results show better prosodic consistency performance between input text and generated audio, and also get higher scores in the subjective prosodic evaluation, and show the ability of voice conversion. Besides, the efficiency of the model is largely boosted through the design of the AI chip operator with 5x acceleration.

* Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023) 
Viaarxiv icon

DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Sep 14, 2023
Zipeng Qi, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang

Generating realistic talking faces is a complex and widely discussed task with numerous applications. In this paper, we present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving. DiffTalker addresses the challenges associated with directly applying diffusion models to audio control, which are traditionally trained on text-image pairs. DiffTalker consists of two agent networks: a transformer-based landmarks completion network for geometric accuracy and a diffusion-based face generation network for texture details. Landmarks play a pivotal role in establishing a seamless connection between the audio and image domains, facilitating the incorporation of knowledge from pre-trained diffusion models. This innovative approach efficiently produces articulate-speaking faces. Experimental results showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces, all without the need for additional alignment between audio and image features.

* submmit to ICASSP 2024 
Viaarxiv icon

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Sep 08, 2023
Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, Jing Xiao

Figure 1 for From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning
Figure 2 for From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning
Figure 3 for From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning
Figure 4 for From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

In the realm of Large Language Models, the balance between instruction data quality and quantity has become a focal point. Recognizing this, we introduce a self-guided methodology for LLMs to autonomously discern and select cherry samples from vast open-source datasets, effectively minimizing manual curation and potential cost for instruction tuning an LLM. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal tool to identify discrepancies between a model's expected responses and its autonomous generation prowess. Through the adept application of IFD, cherry samples are pinpointed, leading to a marked uptick in model training efficiency. Empirical validations on renowned datasets like Alpaca and WizardLM underpin our findings; with a mere 10% of conventional data input, our strategy showcases improved results. This synthesis of self-guided cherry-picking and the IFD metric signifies a transformative leap in the optimization of LLMs, promising both efficiency and resource-conscious advancements.

Viaarxiv icon

Machine Unlearning Methodology base on Stochastic Teacher Network

Aug 28, 2023
Xulong Zhang, Jianzong Wang, Ning Cheng, Yifu Sun, Chuanyao Zhang, Jing Xiao

Figure 1 for Machine Unlearning Methodology base on Stochastic Teacher Network
Figure 2 for Machine Unlearning Methodology base on Stochastic Teacher Network
Figure 3 for Machine Unlearning Methodology base on Stochastic Teacher Network
Figure 4 for Machine Unlearning Methodology base on Stochastic Teacher Network

The rise of the phenomenon of the "right to be forgotten" has prompted research on machine unlearning, which grants data owners the right to actively withdraw data that has been used for model training, and requires the elimination of the contribution of that data to the model. A simple method to achieve this is to use the remaining data to retrain the model, but this is not acceptable for other data owners who continue to participate in training. Existing machine unlearning methods have been found to be ineffective in quickly removing knowledge from deep learning models. This paper proposes using a stochastic network as a teacher to expedite the mitigation of the influence caused by forgotten data on the model. We performed experiments on three datasets, and the findings demonstrate that our approach can efficiently mitigate the influence of target data on the model within a single epoch. This allows for one-time erasure and reconstruction of the model, and the reconstruction model achieves the same performance as the retrained model.

* Accepted by 19th International Conference on Advanced Data Mining and Applications. (ADMA 2023) 
Viaarxiv icon

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Aug 28, 2023
Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

Figure 1 for Voice Conversion with Denoising Diffusion Probabilistic GAN Models
Figure 2 for Voice Conversion with Denoising Diffusion Probabilistic GAN Models
Figure 3 for Voice Conversion with Denoising Diffusion Probabilistic GAN Models
Figure 4 for Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Voice conversion is a method that allows for the transformation of speaking style while maintaining the integrity of linguistic information. There are many researchers using deep generative models for voice conversion tasks. Generative Adversarial Networks (GANs) can quickly generate high-quality samples, but the generated samples lack diversity. The samples generated by the Denoising Diffusion Probabilistic Models (DDPMs) are better than GANs in terms of mode coverage and sample diversity. But the DDPMs have high computational costs and the inference speed is slower than GANs. In order to make GANs and DDPMs more practical we proposes DiffGAN-VC, a variant of GANs and DDPMS, to achieve non-parallel many-to-many voice conversion (VC). We use large steps to achieve denoising, and also introduce a multimodal conditional GANs to model the denoising diffusion generative adversarial network. According to both objective and subjective evaluation experiments, DiffGAN-VC has been shown to achieve high voice quality on non-parallel data sets. Compared with the CycleGAN-VC method, DiffGAN-VC achieves speaker similarity, naturalness and higher sound quality.

* Accepted by 19th International Conference on Advanced Data Mining and Applications. (ADMA 2023) 
Viaarxiv icon

Symbolic & Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music

Aug 28, 2023
Kexin Zhu, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

Figure 1 for Symbolic & Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music
Figure 2 for Symbolic & Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music
Figure 3 for Symbolic & Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music
Figure 4 for Symbolic & Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music

Music Emotion Recognition involves the automatic identification of emotional elements within music tracks, and it has garnered significant attention due to its broad applicability in the field of Music Information Retrieval. It can also be used as the upstream task of many other human-related tasks such as emotional music generation and music recommendation. Due to existing psychology research, music emotion is determined by multiple factors such as the Timbre, Velocity, and Structure of the music. Incorporating multiple factors in MER helps achieve more interpretable and finer-grained methods. However, most prior works were uni-domain and showed weak consistency between arousal modeling performance and valence modeling performance. Based on this background, we designed a multi-domain emotion modeling method for instrumental music that combines symbolic analysis and acoustic analysis. At the same time, because of the rarity of music data and the difficulty of labeling, our multi-domain approach can make full use of limited data. Our approach was implemented and assessed using the publicly available piano dataset EMOPIA, resulting in a notable improvement over our baseline model with a 2.4% increase in overall accuracy, establishing its state-of-the-art performance.

* Accepted by 19th International Conference on Advanced Data Mining and Applications. (ADMA 2023) 
Viaarxiv icon

PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

Aug 21, 2023
Yimin Deng, Huaizhen Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

Figure 1 for PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion
Figure 2 for PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion
Figure 3 for PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion
Figure 4 for PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosody, pace, pause, etc. In this context, prosody modeling is crucial for achieving expressive voice conversion that sounds natural and convincing. Unfortunately, prosody modeling is important but challenging, especially without text transcriptions. In this paper, we firstly propose a novel voice conversion framework named 'PMVC', which effectively separates and models the content, timbre, and prosodic information from the speech without text transcriptions. Specially, we introduce a new speech augmentation algorithm for robust prosody extraction. And building upon this, mask and predict mechanism is applied in the disentanglement of prosody and content information. The experimental results on the AIShell-3 corpus supports our improvement of naturalness and similarity of converted speech.

* Accepted by the 31st ACM International Conference on Multimedia (MM2023) 
Viaarxiv icon