Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xu Li

Britton Chance Center for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics-Huazhong University of Science and Technology, China

CLAPSep: Leveraging Contrastive Pre-trained Models for Multi-Modal Query-Conditioned Target Sound Extraction

Feb 27, 2024

Hao Ma, Zhiyuan Peng, Mingjie Shao, Ju Liu, Xu Li, Xixin Wu

Figure 1 for CLAPSep: Leveraging Contrastive Pre-trained Models for Multi-Modal Query-Conditioned Target Sound Extraction

Figure 2 for CLAPSep: Leveraging Contrastive Pre-trained Models for Multi-Modal Query-Conditioned Target Sound Extraction

Figure 3 for CLAPSep: Leveraging Contrastive Pre-trained Models for Multi-Modal Query-Conditioned Target Sound Extraction

Figure 4 for CLAPSep: Leveraging Contrastive Pre-trained Models for Multi-Modal Query-Conditioned Target Sound Extraction

Abstract:Universal sound separation (USS) aims to extract arbitrary types of sounds from real-world sound recordings. Language-queried target sound extraction (TSE) is an effective approach to achieving USS. Such systems consist of two components: a query network that converts user queries into conditional embeddings, and a separation network that extracts the target sound based on conditional embeddings. Existing methods mainly suffer from two issues: firstly, they require training a randomly initialized model from scratch, lacking the utilization of pre-trained models, and substantial data and computational resources are needed to ensure model convergence; secondly, existing methods need to jointly train a query network and a separation network, which tends to lead to overfitting. To address these issues, we build the CLAPSep model based on contrastive language-audio pre-trained model (CLAP). We achieve this by using a pre-trained text encoder of CLAP as the query network and introducing pre-trained audio encoder weights of CLAP into the separation network to fully utilize the prior knowledge embedded in the pre-trained model to assist in target sound extraction tasks. Extensive experimental results demonstrate that the proposed method saves training resources while ensuring the model's performance and generalizability. Additionally, we explore the model's ability to comprehensively utilize language/audio multi-modal and positive/negative multi-valent user queries, enhancing system performance while providing diversified application modes.

Via

Access Paper or Ask Questions

DFML: Decentralized Federated Mutual Learning

Feb 02, 2024

Yasser H. Khalil, Amir H. Estiri, Mahdi Beitollahi, Nader Asadi, Sobhan Hemati, Xu Li, Guojun Zhang, Xi Chen

Abstract:In the realm of real-world devices, centralized servers in Federated Learning (FL) present challenges including communication bottlenecks and susceptibility to a single point of failure. Additionally, contemporary devices inherently exhibit model and data heterogeneity. Existing work lacks a Decentralized FL (DFL) framework capable of accommodating such heterogeneity without imposing architectural restrictions or assuming the availability of public data. To address these issues, we propose a Decentralized Federated Mutual Learning (DFML) framework that is serverless, supports nonrestrictive heterogeneous models, and avoids reliance on public data. DFML effectively handles model and data heterogeneity through mutual learning, which distills knowledge between clients, and cyclically varying the amount of supervision and distillation signals. Extensive experimental results demonstrate consistent effectiveness of DFML in both convergence speed and global accuracy, outperforming prevalent baselines under various conditions. For example, with the CIFAR-100 dataset and 50 clients, DFML achieves a substantial increase of +17.20% and +19.95% in global accuracy under Independent and Identically Distributed (IID) and non-IID data shifts, respectively.

Via

Access Paper or Ask Questions

Parametric Feature Transfer: One-shot Federated Learning with Foundation Models

Feb 02, 2024

Mahdi Beitollahi, Alex Bie, Sobhan Hemati, Leo Maxime Brunswic, Xu Li, Xi Chen, Guojun Zhang

Abstract:In one-shot federated learning (FL), clients collaboratively train a global model in a single round of communication. Existing approaches for one-shot FL enhance communication efficiency at the expense of diminished accuracy. This paper introduces FedPFT (Federated Learning with Parametric Feature Transfer), a methodology that harnesses the transferability of foundation models to enhance both accuracy and communication efficiency in one-shot FL. The approach involves transferring per-client parametric models (specifically, Gaussian mixtures) of features extracted from foundation models. Subsequently, each parametric model is employed to generate synthetic features for training a classifier head. Experimental results on eight datasets demonstrate that FedPFT enhances the communication-accuracy frontier in both centralized and decentralized FL scenarios, as well as across diverse data-heterogeneity settings such as covariate shift and task shift, with improvements of up to 20.6%. Additionally, FedPFT adheres to the data minimization principle of FL, as clients do not send real features. We demonstrate that sending real features is vulnerable to potent reconstruction attacks. Moreover, we show that FedPFT is amenable to formal privacy guarantees via differential privacy, demonstrating favourable privacy-accuracy tradeoffs.

* 20 pages, 12 figures

Via

Access Paper or Ask Questions

neural concatenative singing voice conversion: rethinking concatenation-based approach for one-shot singing voice conversion

Dec 08, 2023

Binzhu Sha, Xu Li, Zhiyong Wu, Ying Shan, Helen Meng

Abstract:Any-to-any singing voice conversion is confronted with a significant challenge of ``timbre leakage'' issue caused by inadequate disentanglement between the content and the speaker timbre. To address this issue, this study introduces a novel neural concatenative singing voice conversion (NeuCoSVC) framework. The NeuCoSVC framework comprises a self-supervised learning (SSL) representation extractor, a neural harmonic signal generator, and a waveform synthesizer. Specifically, the SSL extractor condenses the audio into a sequence of fixed-dimensional SSL features. The harmonic signal generator produces both raw and filtered harmonic signals as the pitch information by leveraging a linear time-varying (LTV) filter. Finally, the audio generator reconstructs the audio waveform based on the SSL features, as well as the harmonic signals and the loudness information. During inference, the system performs voice conversion by substituting source SSL features with their nearest counterparts from a matching pool, which comprises SSL representations extracted from the target audio, while the raw harmonic signals and the loudness are extracted from the source audio and are kept unchanged. Since the utilized SSL features in the conversion stage are directly from the target audio, the proposed framework has great potential to address the ``timbre leakage'' issue caused by previous disentanglement-based approaches. Experimental results confirm that the proposed system delivers much better performance than the speaker embedding approach (disentanglement-based) in the context of one-shot SVC across intra-language, cross-language, and cross-domain evaluations.

Via

Access Paper or Ask Questions

Recovering Linear Causal Models with Latent Variables via Cholesky Factorization of Covariance Matrix

Nov 01, 2023

Yunfeng Cai, Xu Li, Minging Sun, Ping Li

Figure 1 for Recovering Linear Causal Models with Latent Variables via Cholesky Factorization of Covariance Matrix

Figure 2 for Recovering Linear Causal Models with Latent Variables via Cholesky Factorization of Covariance Matrix

Figure 3 for Recovering Linear Causal Models with Latent Variables via Cholesky Factorization of Covariance Matrix

Figure 4 for Recovering Linear Causal Models with Latent Variables via Cholesky Factorization of Covariance Matrix

Abstract:Discovering the causal relationship via recovering the directed acyclic graph (DAG) structure from the observed data is a well-known challenging combinatorial problem. When there are latent variables, the problem becomes even more difficult. In this paper, we first propose a DAG structure recovering algorithm, which is based on the Cholesky factorization of the covariance matrix of the observed data. The algorithm is fast and easy to implement and has theoretical grantees for exact recovery. On synthetic and real-world datasets, the algorithm is significantly faster than previous methods and achieves the state-of-the-art performance. Furthermore, under the equal error variances assumption, we incorporate an optimization procedure into the Cholesky factorization based algorithm to handle the DAG recovering problem with latent variables. Numerical simulations show that the modified "Cholesky + optimization" algorithm is able to recover the ground truth graph in most cases and outperforms existing algorithms.

Via

Access Paper or Ask Questions

HumTrans: A Novel Open-Source Dataset for Humming Melody Transcription and Beyond

Sep 18, 2023

Shansong Liu, Xu Li, Dian Li, Ying Shan

Figure 1 for HumTrans: A Novel Open-Source Dataset for Humming Melody Transcription and Beyond

Figure 2 for HumTrans: A Novel Open-Source Dataset for Humming Melody Transcription and Beyond

Figure 3 for HumTrans: A Novel Open-Source Dataset for Humming Melody Transcription and Beyond

Figure 4 for HumTrans: A Novel Open-Source Dataset for Humming Melody Transcription and Beyond

Abstract:This paper introduces the HumTrans dataset, which is publicly available and primarily designed for humming melody transcription. The dataset can also serve as a foundation for downstream tasks such as humming melody based music generation. It consists of 500 musical compositions of different genres and languages, with each composition divided into multiple segments. In total, the dataset comprises 1000 music segments. To collect this humming dataset, we employed 10 college students, all of whom are either music majors or proficient in playing at least one musical instrument. Each of them hummed every segment twice using the web recording interface provided by our designed website. The humming recordings were sampled at a frequency of 44,100 Hz. During the humming session, the main interface provides a musical score for students to reference, with the melody audio playing simultaneously to aid in capturing both melody and rhythm. The dataset encompasses approximately 56.22 hours of audio, making it the largest known humming dataset to date. The dataset will be released on Hugging Face, and we will provide a GitHub repository containing baseline results and evaluation codes.

Via

Access Paper or Ask Questions

Enhancing the vocal range of single-speaker singing voice synthesis with melody-unsupervised pre-training

Sep 01, 2023

Shaohuan Zhou, Xu Li, Zhiyong Wu, Ying Shan, Helen Meng

Abstract:The single-speaker singing voice synthesis (SVS) usually underperforms at pitch values that are out of the singer's vocal range or associated with limited training samples. Based on our previous work, this work proposes a melody-unsupervised multi-speaker pre-training method conducted on a multi-singer dataset to enhance the vocal range of the single-speaker, while not degrading the timbre similarity. This pre-training method can be deployed to a large-scale multi-singer dataset, which only contains audio-and-lyrics pairs without phonemic timing information and pitch annotation. Specifically, in the pre-training step, we design a phoneme predictor to produce the frame-level phoneme probability vectors as the phonemic timing information and a speaker encoder to model the timbre variations of different singers, and directly estimate the frame-level f0 values from the audio to provide the pitch information. These pre-trained model parameters are delivered into the fine-tuning step as prior knowledge to enhance the single speaker's vocal range. Moreover, this work also contributes to improving the sound quality and rhythm naturalness of the synthesized singing voices. It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice, and a bi-directional flow model to improve the sound quality. Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.

Via

Access Paper or Ask Questions

AI-Assisted Slicing-Based Resource Management for Two-Tier Radio Access Networks

Aug 21, 2023

Conghao Zhou, Jie Gao, Mushu Li, Xuemin Shen, Weihua Zhuang, Xu Li, Weisen Shi

Figure 1 for AI-Assisted Slicing-Based Resource Management for Two-Tier Radio Access Networks

Figure 2 for AI-Assisted Slicing-Based Resource Management for Two-Tier Radio Access Networks

Figure 3 for AI-Assisted Slicing-Based Resource Management for Two-Tier Radio Access Networks

Figure 4 for AI-Assisted Slicing-Based Resource Management for Two-Tier Radio Access Networks

Abstract:While network slicing has become a prevalent approach to service differentiation, radio access network (RAN) slicing remains challenging due to the need of substantial adaptivity and flexibility to cope with the highly dynamic network environment in RANs. In this paper, we develop a slicing-based resource management framework for a two-tier RAN to support multiple services with different quality of service (QoS) requirements. The developed framework focuses on base station (BS) service coverage (SC) and interference management for multiple slices, each of which corresponds to a service. New designs are introduced in the spatial, temporal, and slice dimensions to cope with spatiotemporal variations in data traffic, balance adaptivity and overhead of resource management, and enhance flexibility in service differentiation. Based on the proposed framework, an energy efficiency maximization problem is formulated, and an artificial intelligence (AI)-assisted approach is proposed to solve the problem. Specifically, a deep unsupervised learning-assisted algorithm is proposed for searching the optimal SC of the BSs, and an optimization-based analytical solution is found for managing interference among BSs. Simulation results under different data traffic distributions demonstrate that our proposed slicing-based resource management framework, empowered by the AI-assisted approach, outperforms the benchmark frameworks and achieves a close-to-optimal performance in energy efficiency.

* Accepted by IEEE Transactions on Cognitive Communications and Networking

Via

Access Paper or Ask Questions

Inferring Gene Regulatory Neural Networks for Bacterial Decision Making in Biofilms

Jan 10, 2023

Samitha Somathilaka, Daniel P. Martins, Xu Li, Yusong Li, Sasitharan Balasubramaniam

Figure 1 for Inferring Gene Regulatory Neural Networks for Bacterial Decision Making in Biofilms

Figure 2 for Inferring Gene Regulatory Neural Networks for Bacterial Decision Making in Biofilms

Figure 3 for Inferring Gene Regulatory Neural Networks for Bacterial Decision Making in Biofilms

Figure 4 for Inferring Gene Regulatory Neural Networks for Bacterial Decision Making in Biofilms

Abstract:Bacterial cells are sensitive to a range of external signals used to learn the environment. These incoming external signals are then processed using a Gene Regulatory Network (GRN), exhibiting similarities to modern computing algorithms. An in-depth analysis of gene expression dynamics suggests an inherited Gene Regulatory Neural Network (GRNN) behavior within the GRN that enables the cellular decision-making based on received signals from the environment and neighbor cells. In this study, we extract a sub-network of \textit{Pseudomonas aeruginosa} GRN that is associated with one virulence factor: pyocyanin production as a use case to investigate the GRNN behaviors. Further, using Graph Neural Network (GNN) architecture, we model a single species biofilm to reveal the role of GRNN dynamics on ecosystem-wide decision-making. Varying environmental conditions, we prove that the extracted GRNN computes input signals similar to natural decision-making process of the cell. Identifying of neural network behaviors in GRNs may lead to more accurate bacterial cell activity predictive models for many applications, including human health-related problems and agricultural applications. Further, this model can produce data on causal relationships throughout the network, enabling the possibility of designing tailor-made infection-controlling mechanisms. More interestingly, these GRNNs can perform computational tasks for bio-hybrid computing systems.

Via

Access Paper or Ask Questions

Covariance Regularization for Probabilistic Linear Discriminant Analysis

Dec 06, 2022

Zhiyuan Peng, Mingjie Shao, Xuanji He, Xu Li, Tan Lee, Ke Ding, Guanglu Wan

Abstract:Probabilistic linear discriminant analysis (PLDA) is commonly used in speaker verification systems to score the similarity of speaker embeddings. Recent studies improved the performance of PLDA in domain-matched conditions by diagonalizing its covariance. We suspect such brutal pruning approach could eliminate its capacity in modeling dimension correlation of speaker embeddings, leading to inadequate performance with domain adaptation. This paper explores two alternative covariance regularization approaches, namely, interpolated PLDA and sparse PLDA, to tackle the problem. The interpolated PLDA incorporates the prior knowledge from cosine scoring to interpolate the covariance of PLDA. The sparse PLDA introduces a sparsity penalty to update the covariance. Experimental results demonstrate that both approaches outperform diagonal regularization noticeably with domain adaptation. In addition, in-domain data can be significantly reduced when training sparse PLDA for domain adaptation.

Via

Access Paper or Ask Questions