Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jing Xiao

DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation

Nov 29, 2023

Jiangzong Wang, Pengcheng Li, Xulong Zhang, Ning Cheng, Jing Xiao

Figure 1 for DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation

Figure 2 for DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation

Figure 3 for DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation

Figure 4 for DQR-TTS: Semi-supervised Text-to-speech Synthesis with Dynamic Quantized Representation

Abstract:Most existing neural-based text-to-speech methods rely on extensive datasets and face challenges under low-resource condition. In this paper, we introduce a novel semi-supervised text-to-speech synthesis model that learns from both paired and unpaired data to address this challenge. The key component of the proposed model is a dynamic quantized representation module, which is integrated into a sequential autoencoder. When given paired data, the module incorporates a trainable codebook that learns quantized representations under the supervision of the paired data. However, due to the limited paired data in low-resource scenario, these paired data are difficult to cover all phonemes. Then unpaired data is fed to expand the dynamic codebook by adding quantized representation vectors that are sufficiently distant from the existing ones during training. Experiments show that with less than 120 minutes of paired data, the proposed method outperforms existing methods in both subjective and objective metrics.

* Accepted by the 21st IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA 2023)

Via

Access Paper or Ask Questions

Towards Accurate Loop Closure Detection in Semantic SLAM with 3D Semantic Covisibility Graphs

Nov 21, 2023

Zhentian Qian, Jie Fu, Jing Xiao

Figure 1 for Towards Accurate Loop Closure Detection in Semantic SLAM with 3D Semantic Covisibility Graphs

Figure 2 for Towards Accurate Loop Closure Detection in Semantic SLAM with 3D Semantic Covisibility Graphs

Figure 3 for Towards Accurate Loop Closure Detection in Semantic SLAM with 3D Semantic Covisibility Graphs

Figure 4 for Towards Accurate Loop Closure Detection in Semantic SLAM with 3D Semantic Covisibility Graphs

Abstract:Loop closure is necessary for correcting errors accumulated in simultaneous localization and mapping (SLAM) in unknown environments. However, conventional loop closure methods based on low-level geometric or image features may cause high ambiguity by not distinguishing similar scenarios. Thus, incorrect loop closures can occur. Though semantic 2D image information is considered in some literature to detect loop closures, there is little work that compares 3D scenes as an integral part of a semantic SLAM system. This paper introduces an approach, called SmSLAM+LCD, integrated into a semantic SLAM system to combine high-level 3D semantic information and low-level feature information to conduct accurate loop closure detection and effective drift reduction. The effectiveness of our approach is demonstrated in testing results.

Via

Access Paper or Ask Questions

GAIA: Delving into Gradient-based Attribution Abnormality for Out-of-distribution Detection

Nov 16, 2023

Jinggang Chen, Junjie Li, Xiaoyang Qu, Jianzong Wang, Jiguang Wan, Jing Xiao

Abstract:Detecting out-of-distribution (OOD) examples is crucial to guarantee the reliability and safety of deep neural networks in real-world settings. In this paper, we offer an innovative perspective on quantifying the disparities between in-distribution (ID) and OOD data -- analyzing the uncertainty that arises when models attempt to explain their predictive decisions. This perspective is motivated by our observation that gradient-based attribution methods encounter challenges in assigning feature importance to OOD data, thereby yielding divergent explanation patterns. Consequently, we investigate how attribution gradients lead to uncertain explanation outcomes and introduce two forms of abnormalities for OOD detection: the zero-deflation abnormality and the channel-wise average abnormality. We then propose GAIA, a simple and effective approach that incorporates Gradient Abnormality Inspection and Aggregation. The effectiveness of GAIA is validated on both commonly utilized (CIFAR) and large-scale (ImageNet-1k) benchmarks. Specifically, GAIA reduces the average FPR95 by 23.10% on CIFAR10 and by 45.41% on CIFAR100 compared to advanced post-hoc methods.

* Accepted by NeurIPS2023

Via

Access Paper or Ask Questions

CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation

Nov 15, 2023

Yimin Deng, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

Figure 1 for CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation

Figure 2 for CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation

Figure 3 for CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation

Figure 4 for CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation

Abstract:Better disentanglement of speech representation is essential to improve the quality of voice conversion. Recently contrastive learning is applied to voice conversion successfully based on speaker labels. However, the performance of model will reduce in conversion between similar speakers. Hence, we propose an augmented negative sample selection to address the issue. Specifically, we create hard negative samples based on the proposed speaker fusion module to improve learning ability of speaker encoder. Furthermore, considering the fine-grain modeling of speaker style, we employ a reference encoder to extract fine-grained style and conduct the augmented contrastive learning on global style. The experimental results show that the proposed method outperforms previous work in voice conversion tasks.

* Accepted by the 21st IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA 2023)

Via

Access Paper or Ask Questions

CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding

Nov 15, 2023

Jianzong Wang, Yimin Deng, Ziqi Liang, Xulong Zhang, Ning Cheng, Jing Xiao

Figure 1 for CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding

Figure 2 for CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding

Figure 3 for CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding

Figure 4 for CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding

Abstract:This paper proposes a talking face generation method named "CP-EB" that takes an audio signal as input and a person image as reference, to synthesize a photo-realistic people talking video with head poses controlled by a short video clip and proper eye blinking embedding. It's noted that not only the head pose but also eye blinking are both important aspects for deep fake detection. The implicit control of poses by video has already achieved by the state-of-art work. According to recent research, eye blinking has weak correlation with input audio which means eye blinks extraction from audio and generation are possible. Hence, we propose a GAN-based architecture to extract eye blink feature from input audio and reference video respectively and employ contrastive training between them, then embed it into the concatenated features of identity and poses to generate talking face images. Experimental results show that the proposed method can generate photo-realistic talking face with synchronous lips motions, natural head poses and blinking eyes.

* Accepted by the 21st IEEE International Symposium on Parallel and Distributed Processing with Applications (IEEE ISPA 2023)

Via

Access Paper or Ask Questions

PRCA: Fitting Black-Box Large Language Models for Retrieval Question Answering via Pluggable Reward-Driven Contextual Adapter

Oct 23, 2023

Haoyan Yang, Zhitao Li, Yong Zhang, Jianzong Wang, Ning Cheng, Ming Li, Jing Xiao

Figure 1 for PRCA: Fitting Black-Box Large Language Models for Retrieval Question Answering via Pluggable Reward-Driven Contextual Adapter

Figure 2 for PRCA: Fitting Black-Box Large Language Models for Retrieval Question Answering via Pluggable Reward-Driven Contextual Adapter

Figure 3 for PRCA: Fitting Black-Box Large Language Models for Retrieval Question Answering via Pluggable Reward-Driven Contextual Adapter

Figure 4 for PRCA: Fitting Black-Box Large Language Models for Retrieval Question Answering via Pluggable Reward-Driven Contextual Adapter

Abstract:The Retrieval Question Answering (ReQA) task employs the retrieval-augmented framework, composed of a retriever and generator. The generator formulates the answer based on the documents retrieved by the retriever. Incorporating Large Language Models (LLMs) as generators is beneficial due to their advanced QA capabilities, but they are typically too large to be fine-tuned with budget constraints while some of them are only accessible via APIs. To tackle this issue and further improve ReQA performance, we propose a trainable Pluggable Reward-Driven Contextual Adapter (PRCA), keeping the generator as a black box. Positioned between the retriever and generator in a Pluggable manner, PRCA refines the retrieved information by operating in a token-autoregressive strategy via maximizing rewards of the reinforcement learning phase. Our experiments validate PRCA's effectiveness in enhancing ReQA performance on three datasets by up to 20% improvement to fit black-box LLMs into existing frameworks, demonstrating its considerable potential in the LLMs era.

* Accepted by the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. (EMNLP2023)

Via

Access Paper or Ask Questions

VoiceExtender: Short-utterance Text-independent Speaker Verification with Guided Diffusion Model

Oct 07, 2023

Yayun He, Zuheng Kang, Jianzong Wang, Junqing Peng, Jing Xiao

Figure 1 for VoiceExtender: Short-utterance Text-independent Speaker Verification with Guided Diffusion Model

Figure 2 for VoiceExtender: Short-utterance Text-independent Speaker Verification with Guided Diffusion Model

Figure 3 for VoiceExtender: Short-utterance Text-independent Speaker Verification with Guided Diffusion Model

Figure 4 for VoiceExtender: Short-utterance Text-independent Speaker Verification with Guided Diffusion Model

Abstract:Speaker verification (SV) performance deteriorates as utterances become shorter. To this end, we propose a new architecture called VoiceExtender which provides a promising solution for improving SV performance when handling short-duration speech signals. We use two guided diffusion models, the built-in and the external speaker embedding (SE) guided diffusion model, both of which utilize a diffusion model-based sample generator that leverages SE guidance to augment the speech features based on a short utterance. Extensive experimental results on the VoxCeleb1 dataset show that our method outperforms the baseline, with relative improvements in equal error rate (EER) of 46.1%, 35.7%, 10.4%, and 5.7% for the short utterance conditions of 0.5, 1.0, 1.5, and 2.0 seconds, respectively.

* Accepted by the 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2023)

Via

Access Paper or Ask Questions

Prior Bilinear Based Models for Knowledge Graph Completion

Sep 25, 2023

Jiayi Li, Ruilin Luo, Jiaqi Sun, Jing Xiao, Yujiu Yang

Abstract:Bilinear based models are powerful and widely used approaches for Knowledge Graphs Completion (KGC). Although bilinear based models have achieved significant advances, these studies mainly concentrate on posterior properties (based on evidence, e.g. symmetry pattern) while neglecting the prior properties. In this paper, we find a prior property named "the law of identity" that cannot be captured by bilinear based models, which hinders them from comprehensively modeling the characteristics of KGs. To address this issue, we introduce a solution called Unit Ball Bilinear Model (UniBi). This model not only achieves theoretical superiority but also offers enhanced interpretability and performance by minimizing ineffective learning through minimal constraints. Experiments demonstrate that UniBi models the prior property and verify its interpretability and performance.

Via

Access Paper or Ask Questions

Active perception network for non-myopic online exploration and visual surface coverage

Sep 21, 2023

David Vutetakis, Jing Xiao

Abstract:This work addresses the problem of online exploration and visual sensor coverage of unknown environments. We introduce a novel perception roadmap we refer to as the Active Perception Network (APN) that serves as a hierarchical topological graph describing how to traverse and perceive an incrementally built spatial map of the environment. The APN state is incrementally updated to expand a connected configuration space that extends throughout as much of the known space as possible, using efficient difference-awareness techniques that track the discrete changes of the spatial map to inform the updates. A frontier-guided approach is presented for efficient evaluation of information gain and covisible information, which guides view sampling and refinement to ensure maximum coverage of the unmapped space is maintained within the APN. The updated roadmap is hierarchically decomposed into subgraph regions which we use to facilitate a non-myopic global view sequence planner. A comparative analysis to several state-of-the-art approaches was conducted, showing significant performance improvements in terms of total exploration time and surface coverage, and demonstrating high computational efficiency that is scalable to large and complex environments.

* 19 pages, 10 figures

Via

Access Paper or Ask Questions

AOSR-Net: All-in-One Sandstorm Removal Network

Sep 16, 2023

Yazhong Si, Xulong Zhang, Fan Yang, Jianzong Wang, Ning Cheng, Jing Xiao

Abstract:Most existing sandstorm image enhancement methods are based on traditional theory and prior knowledge, which often restrict their applicability in real-world scenarios. In addition, these approaches often adopt a strategy of color correction followed by dust removal, which makes the algorithm structure too complex. To solve the issue, we introduce a novel image restoration model, named all-in-one sandstorm removal network (AOSR-Net). This model is developed based on a re-formulated sandstorm scattering model, which directly establishes the image mapping relationship by integrating intermediate parameters. Such integration scheme effectively addresses the problems of over-enhancement and weak generalization in the field of sand dust image enhancement. Experimental results on synthetic and real-world sandstorm images demonstrate the superiority of the proposed AOSR-Net over state-of-the-art (SOTA) algorithms.

* Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023)

Via

Access Paper or Ask Questions