Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Runwu Shi

Unsupervised Single-Channel Audio Separation with Diffusion Source Priors

Dec 23, 2025

Runwu Shi, Chang Li, Jiang Wang, Rui Zhang, Nabeela Khan, Benjamin Yen, Takeshi Ashizawa, Kazuhiro Nakadai

Abstract:Single-channel audio separation aims to separate individual sources from a single-channel mixture. Most existing methods rely on supervised learning with synthetically generated paired data. However, obtaining high-quality paired data in real-world scenarios is often difficult. This data scarcity can degrade model performance under unseen conditions and limit generalization ability. To this end, in this work, we approach this problem from an unsupervised perspective, framing it as a probabilistic inverse problem. Our method requires only diffusion priors trained on individual sources. Separation is then achieved by iteratively guiding an initial state toward the solution through reconstruction guidance. Importantly, we introduce an advanced inverse problem solver specifically designed for separation, which mitigates gradient conflicts caused by interference between the diffusion prior and reconstruction guidance during inverse denoising. This design ensures high-quality and balanced separation performance across individual sources. Additionally, we find that initializing the denoising process with an augmented mixture instead of pure Gaussian noise provides an informative starting point that significantly improves the final performance. To further enhance audio prior modeling, we design a novel time-frequency attention-based network architecture that demonstrates strong audio modeling capability. Collectively, these improvements lead to significant performance gains, as validated across speech-sound event, sound event, and speech separation tasks.

Via

Access Paper or Ask Questions

Single-Channel Target Speech Extraction Utilizing Distance and Room Clues

May 20, 2025

Runwu Shi, Zirui Lin, Benjamin Yen, Jiang Wang, Ragib Amin Nihal, Kazuhiro Nakadai

Figure 1 for Single-Channel Target Speech Extraction Utilizing Distance and Room Clues

Figure 2 for Single-Channel Target Speech Extraction Utilizing Distance and Room Clues

Figure 3 for Single-Channel Target Speech Extraction Utilizing Distance and Room Clues

Figure 4 for Single-Channel Target Speech Extraction Utilizing Distance and Room Clues

Abstract:This paper aims to achieve single-channel target speech extraction (TSE) in enclosures utilizing distance clues and room information. Recent works have verified the feasibility of distance clues for the TSE task, which can imply the sound source's direct-to-reverberation ratio (DRR) and thus can be utilized for speech separation and TSE systems. However, such distance clue is significantly influenced by the room's acoustic characteristics, such as dimension and reverberation time, making it challenging for TSE systems that rely solely on distance clues to generalize across a variety of different rooms. To solve this, we suggest providing room environmental information (room dimensions and reverberation time) for distance-based TSE for better generalization capabilities. Especially, we propose a distance and environment-based TSE model in the time-frequency (TF) domain with learnable distance and room embedding. Results on both simulated and real collected datasets demonstrate its feasibility. Demonstration materials are available at https://runwushi.github.io/distance-room-demo-page/.

* 5 pages, 3 figures, accepted by Eusipco 2025

Via

Access Paper or Ask Questions

Distance Based Single-Channel Target Speech Extraction

Dec 28, 2024

Runwu Shi, Benjamin Yen, Kazuhiro Nakadai

Figure 1 for Distance Based Single-Channel Target Speech Extraction

Figure 2 for Distance Based Single-Channel Target Speech Extraction

Figure 3 for Distance Based Single-Channel Target Speech Extraction

Figure 4 for Distance Based Single-Channel Target Speech Extraction

Abstract:This paper aims to achieve single-channel target speech extraction (TSE) in enclosures by solely utilizing distance information. This is the first work that utilizes only distance cues without using speaker physiological information for single-channel TSE. Inspired by recent single-channel Distance-based separation and extraction methods, we introduce a novel model that efficiently fuses distance information with time-frequency (TF) bins for TSE. Experimental results in both single-room and multi-room scenarios demonstrate the feasibility and effectiveness of our approach. This method can also be employed to estimate the distances of different speakers in mixed speech. Online demos are available at https://runwushi.github.io/distance-demo-page.

* 5 pages, 3 figures, accepted by ICASSP 2025

Via

Access Paper or Ask Questions

Bird Vocalization Embedding Extraction Using Self-Supervised Disentangled Representation Learning

Dec 28, 2024

Runwu Shi, Katsutoshi Itoyama, Kazuhiro Nakadai

Figure 1 for Bird Vocalization Embedding Extraction Using Self-Supervised Disentangled Representation Learning

Figure 2 for Bird Vocalization Embedding Extraction Using Self-Supervised Disentangled Representation Learning

Figure 3 for Bird Vocalization Embedding Extraction Using Self-Supervised Disentangled Representation Learning

Figure 4 for Bird Vocalization Embedding Extraction Using Self-Supervised Disentangled Representation Learning

Abstract:This paper addresses the extraction of the bird vocalization embedding from the whole song level using disentangled representation learning (DRL). Bird vocalization embeddings are necessary for large-scale bioacoustic tasks, and self-supervised methods such as Variational Autoencoder (VAE) have shown their performance in extracting such low-dimensional embeddings from vocalization segments on the note or syllable level. To extend the processing level to the entire song instead of cutting into segments, this paper regards each vocalization as the generalized and discriminative part and uses two encoders to learn these two parts. The proposed method is evaluated on the Great Tits dataset according to the clustering performance, and the results outperform the compared pre-trained models and vanilla VAE. Finally, this paper analyzes the informative part of the embedding, further compresses its dimension, and explains the disentangled performance of bird vocalizations.

* Presented on Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR 2024), https://vihar-2024.vihar.org/assets/VIHAR_2024_proceedings.pdf

Via

Access Paper or Ask Questions