Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jing Xiao

Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

Sep 16, 2023

Kaiyi Luo, Xulong Zhang, Jianzong Wang, Huaxiong Li, Ning Cheng, Jing Xiao

Figure 1 for Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

Figure 2 for Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

Figure 3 for Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

Abstract:Cross-modal retrieval (CMR) has been extensively applied in various domains, such as multimedia search engines and recommendation systems. Most existing CMR methods focus on image-to-text retrieval, whereas audio-to-text retrieval, a less explored domain, has posed a great challenge due to the difficulty to uncover discriminative features from audio clips and texts. Existing studies are restricted in the following two ways: 1) Most researchers utilize contrastive learning to construct a common subspace where similarities among data can be measured. However, they considers only cross-modal transformation, neglecting the intra-modal separability. Besides, the temperature parameter is not adaptively adjusted along with semantic guidance, which degrades the performance. 2) These methods do not take latent representation reconstruction into account, which is essential for semantic alignment. This paper introduces a novel audio-text oriented CMR approach, termed Contrastive Latent Space Reconstruction Learning (CLSR). CLSR improves contrastive representation learning by taking intra-modal separability into account and adopting an adaptive temperature control strategy. Moreover, the latent representation reconstruction modules are embedded into the CMR framework, which improves modal interaction. Experiments in comparison with some state-of-the-art methods on two audio-text datasets have validated the superiority of CLSR.

* Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023)

Via

Access Paper or Ask Questions

FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

Sep 16, 2023

Jianzong Wang, Xulong Zhang, Aolan Sun, Ning Cheng, Jing Xiao

Figure 1 for FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

Figure 2 for FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

Figure 3 for FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

Figure 4 for FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

Abstract:This paper integrates graph-to-sequence into an end-to-end text-to-speech framework for syntax-aware modelling with syntactic information of input text. Specifically, the input text is parsed by a dependency parsing module to form a syntactic graph. The syntactic graph is then encoded by a graph encoder to extract the syntactic hidden information, which is concatenated with phoneme embedding and input to the alignment and flow-based decoding modules to generate the raw audio waveform. The model is experimented on two languages, English and Mandarin, using single-speaker, few samples of target speakers, and multi-speaker datasets, respectively. Experimental results show better prosodic consistency performance between input text and generated audio, and also get higher scores in the subjective prosodic evaluation, and show the ability of voice conversion. Besides, the efficiency of the model is largely boosted through the design of the AI chip operator with 5x acceleration.

* Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023)

Via

Access Paper or Ask Questions

DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Sep 14, 2023

Zipeng Qi, Xulong Zhang, Ning Cheng, Jing Xiao, Jianzong Wang

Figure 1 for DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Figure 2 for DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Figure 3 for DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Figure 4 for DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Abstract:Generating realistic talking faces is a complex and widely discussed task with numerous applications. In this paper, we present DiffTalker, a novel model designed to generate lifelike talking faces through audio and landmark co-driving. DiffTalker addresses the challenges associated with directly applying diffusion models to audio control, which are traditionally trained on text-image pairs. DiffTalker consists of two agent networks: a transformer-based landmarks completion network for geometric accuracy and a diffusion-based face generation network for texture details. Landmarks play a pivotal role in establishing a seamless connection between the audio and image domains, facilitating the incorporation of knowledge from pre-trained diffusion models. This innovative approach efficiently produces articulate-speaking faces. Experimental results showcase DiffTalker's superior performance in producing clear and geometrically accurate talking faces, all without the need for additional alignment between audio and image features.

* submmit to ICASSP 2024

Via

Access Paper or Ask Questions

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning

Sep 08, 2023

Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, Jing Xiao

Abstract:In the realm of Large Language Models, the balance between instruction data quality and quantity has become a focal point. Recognizing this, we introduce a self-guided methodology for LLMs to autonomously discern and select cherry samples from vast open-source datasets, effectively minimizing manual curation and potential cost for instruction tuning an LLM. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal tool to identify discrepancies between a model's expected responses and its autonomous generation prowess. Through the adept application of IFD, cherry samples are pinpointed, leading to a marked uptick in model training efficiency. Empirical validations on renowned datasets like Alpaca and WizardLM underpin our findings; with a mere 10% of conventional data input, our strategy showcases improved results. This synthesis of self-guided cherry-picking and the IFD metric signifies a transformative leap in the optimization of LLMs, promising both efficiency and resource-conscious advancements.

Via

Access Paper or Ask Questions

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Aug 28, 2023

Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

Abstract:Voice conversion is a method that allows for the transformation of speaking style while maintaining the integrity of linguistic information. There are many researchers using deep generative models for voice conversion tasks. Generative Adversarial Networks (GANs) can quickly generate high-quality samples, but the generated samples lack diversity. The samples generated by the Denoising Diffusion Probabilistic Models (DDPMs) are better than GANs in terms of mode coverage and sample diversity. But the DDPMs have high computational costs and the inference speed is slower than GANs. In order to make GANs and DDPMs more practical we proposes DiffGAN-VC, a variant of GANs and DDPMS, to achieve non-parallel many-to-many voice conversion (VC). We use large steps to achieve denoising, and also introduce a multimodal conditional GANs to model the denoising diffusion generative adversarial network. According to both objective and subjective evaluation experiments, DiffGAN-VC has been shown to achieve high voice quality on non-parallel data sets. Compared with the CycleGAN-VC method, DiffGAN-VC achieves speaker similarity, naturalness and higher sound quality.

* Accepted by 19th International Conference on Advanced Data Mining and Applications. (ADMA 2023)

Via

Access Paper or Ask Questions

Symbolic & Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music

Aug 28, 2023

Kexin Zhu, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

Figure 1 for Symbolic & Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music

Figure 2 for Symbolic & Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music

Figure 3 for Symbolic & Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music

Figure 4 for Symbolic & Acoustic: Multi-domain Music Emotion Modeling for Instrumental Music

Abstract:Music Emotion Recognition involves the automatic identification of emotional elements within music tracks, and it has garnered significant attention due to its broad applicability in the field of Music Information Retrieval. It can also be used as the upstream task of many other human-related tasks such as emotional music generation and music recommendation. Due to existing psychology research, music emotion is determined by multiple factors such as the Timbre, Velocity, and Structure of the music. Incorporating multiple factors in MER helps achieve more interpretable and finer-grained methods. However, most prior works were uni-domain and showed weak consistency between arousal modeling performance and valence modeling performance. Based on this background, we designed a multi-domain emotion modeling method for instrumental music that combines symbolic analysis and acoustic analysis. At the same time, because of the rarity of music data and the difficulty of labeling, our multi-domain approach can make full use of limited data. Our approach was implemented and assessed using the publicly available piano dataset EMOPIA, resulting in a notable improvement over our baseline model with a 2.4% increase in overall accuracy, establishing its state-of-the-art performance.

* Accepted by 19th International Conference on Advanced Data Mining and Applications. (ADMA 2023)

Via

Access Paper or Ask Questions

Machine Unlearning Methodology base on Stochastic Teacher Network

Aug 28, 2023

Xulong Zhang, Jianzong Wang, Ning Cheng, Yifu Sun, Chuanyao Zhang, Jing Xiao

Abstract:The rise of the phenomenon of the "right to be forgotten" has prompted research on machine unlearning, which grants data owners the right to actively withdraw data that has been used for model training, and requires the elimination of the contribution of that data to the model. A simple method to achieve this is to use the remaining data to retrain the model, but this is not acceptable for other data owners who continue to participate in training. Existing machine unlearning methods have been found to be ineffective in quickly removing knowledge from deep learning models. This paper proposes using a stochastic network as a teacher to expedite the mitigation of the influence caused by forgotten data on the model. We performed experiments on three datasets, and the findings demonstrate that our approach can efficiently mitigate the influence of target data on the model within a single epoch. This allows for one-time erasure and reconstruction of the model, and the reconstruction model achieves the same performance as the retrained model.

* Accepted by 19th International Conference on Advanced Data Mining and Applications. (ADMA 2023)

Via

Access Paper or Ask Questions

Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment

Aug 23, 2023

Kangmin Xu, Liang Liao, Jing Xiao, Chaofeng Chen, Haoning Wu, Qiong Yan, Weisi Lin

Figure 1 for Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment

Figure 2 for Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment

Figure 3 for Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment

Figure 4 for Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment

Abstract:Image Quality Assessment (IQA) constitutes a fundamental task within the field of computer vision, yet it remains an unresolved challenge, owing to the intricate distortion conditions, diverse image contents, and limited availability of data. Recently, the community has witnessed the emergence of numerous large-scale pretrained foundation models, which greatly benefit from dramatically increased data and parameter capacities. However, it remains an open problem whether the scaling law in high-level tasks is also applicable to IQA task which is closely related to low-level clues. In this paper, we demonstrate that with proper injection of local distortion features, a larger pretrained and fixed foundation model performs better in IQA tasks. Specifically, for the lack of local distortion structure and inductive bias of vision transformer (ViT), alongside the large-scale pretrained ViT, we use another pretrained convolution neural network (CNN), which is well known for capturing the local structure, to extract multi-scale image features. Further, we propose a local distortion extractor to obtain local distortion features from the pretrained CNN and a local distortion injector to inject the local distortion features into ViT. By only training the extractor and injector, our method can benefit from the rich knowledge in the powerful foundation models and achieve state-of-the-art performance on popular IQA datasets, indicating that IQA is not only a low-level problem but also benefits from stronger high-level features drawn from large-scale pretrained models.

Via

Access Paper or Ask Questions

PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

Aug 21, 2023

Yimin Deng, Huaizhen Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

Figure 1 for PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

Figure 2 for PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

Figure 3 for PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

Figure 4 for PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

Abstract:Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC tasks. However, a good voice conversion model should not only match the timbre information of the target speaker, but also expressive information such as prosody, pace, pause, etc. In this context, prosody modeling is crucial for achieving expressive voice conversion that sounds natural and convincing. Unfortunately, prosody modeling is important but challenging, especially without text transcriptions. In this paper, we firstly propose a novel voice conversion framework named 'PMVC', which effectively separates and models the content, timbre, and prosodic information from the speech without text transcriptions. Specially, we introduce a new speech augmentation algorithm for robust prosody extraction. And building upon this, mask and predict mechanism is applied in the disentanglement of prosody and content information. The experimental results on the AIShell-3 corpus supports our improvement of naturalness and similarity of converted speech.

* Accepted by the 31st ACM International Conference on Multimedia (MM2023)

Via

Access Paper or Ask Questions

EdgeMA: Model Adaptation System for Real-Time Video Analytics on Edge Devices

Aug 17, 2023

Liang Wang, Nan Zhang, Xiaoyang Qu, Jianzong Wang, Jiguang Wan, Guokuan Li, Kaiyu Hu, Guilin Jiang, Jing Xiao

Figure 1 for EdgeMA: Model Adaptation System for Real-Time Video Analytics on Edge Devices

Figure 2 for EdgeMA: Model Adaptation System for Real-Time Video Analytics on Edge Devices

Figure 3 for EdgeMA: Model Adaptation System for Real-Time Video Analytics on Edge Devices

Figure 4 for EdgeMA: Model Adaptation System for Real-Time Video Analytics on Edge Devices

Abstract:Real-time video analytics on edge devices for changing scenes remains a difficult task. As edge devices are usually resource-constrained, edge deep neural networks (DNNs) have fewer weights and shallower architectures than general DNNs. As a result, they only perform well in limited scenarios and are sensitive to data drift. In this paper, we introduce EdgeMA, a practical and efficient video analytics system designed to adapt models to shifts in real-world video streams over time, addressing the data drift problem. EdgeMA extracts the gray level co-occurrence matrix based statistical texture feature and uses the Random Forest classifier to detect the domain shift. Moreover, we have incorporated a method of model adaptation based on importance weighting, specifically designed to update models to cope with the label distribution shift. Through rigorous evaluation of EdgeMA on a real-world dataset, our results illustrate that EdgeMA significantly improves inference accuracy.

* Accepted by 30th International Conference on Neural Information Processing (ICONIP 2023)

Via

Access Paper or Ask Questions