Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kohei Saijo

FlexIO: Flexible Single- and Multi-Channel Speech Separation and Enhancement

Oct 24, 2025

Yoshiki Masuyama, Kohei Saijo, Francesco Paissan, Jiangyu Han, Marc Delcroix, Ryo Aihara, François G. Germain, Gordon Wichern, Jonathan Le Roux

Abstract:Speech separation and enhancement (SSE) has advanced remarkably and achieved promising results in controlled settings, such as a fixed number of speakers and a fixed array configuration. Towards a universal SSE system, single-channel systems have been extended to deal with a variable number of speakers (i.e., outputs). Meanwhile, multi-channel systems accommodating various array configurations (i.e., inputs) have been developed. However, these attempts have been pursued separately. In this paper, we propose a flexible input and output SSE system, named FlexIO. It performs conditional separation using prompt vectors, one per speaker as a condition, allowing separation of an arbitrary number of speakers. Multi-channel mixtures are processed together with the prompt vectors via an array-agnostic channel communication mechanism. Our experiments demonstrate that FlexIO successfully covers diverse conditions with one to five microphones and one to three speakers. We also confirm the robustness of FlexIO on CHiME-4 real data.

* Submitted to ICASSP 2026

Via

Access Paper or Ask Questions

FasTUSS: Faster Task-Aware Unified Source Separation

Jul 15, 2025

Francesco Paissan, Gordon Wichern, Yoshiki Masuyama, Ryo Aihara, François G. Germain, Kohei Saijo, Jonathan Le Roux

Abstract:Time-Frequency (TF) dual-path models are currently among the best performing audio source separation network architectures, achieving state-of-the-art performance in speech enhancement, music source separation, and cinematic audio source separation. While they are characterized by a relatively low parameter count, they still require a considerable number of operations, implying a higher execution time. This problem is exacerbated by the trend towards bigger models trained on large amounts of data to solve more general tasks, such as the recently introduced task-aware unified source separation (TUSS) model. TUSS, which aims to solve audio source separation tasks using a single, conditional model, is built upon TF-Locoformer, a TF dual-path model combining convolution and attention layers. The task definition comes in the form of a sequence of prompts that specify the number and type of sources to be extracted. In this paper, we analyze the design choices of TUSS with the goal of optimizing its performance-complexity trade-off. We derive two more efficient models, FasTUSS-8.3G and FasTUSS-11.7G that reduce the original model's operations by 81\% and 73\% with minor performance drops of 1.2~dB and 0.4~dB averaged over all benchmarks, respectively. Additionally, we investigate the impact of prompt conditioning to derive a causal TUSS model.

* Accepted to WASPAA 2025

Via

Access Paper or Ask Questions

Interspeech 2025 URGENT Speech Enhancement Challenge

May 29, 2025

Kohei Saijo, Wangyou Zhang, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach, Yihui Fu, Wei Wang(+2 more)

Abstract:There has been a growing effort to develop universal speech enhancement (SE) to handle inputs with various speech distortions and recording conditions. The URGENT Challenge series aims to foster such universal SE by embracing a broad range of distortion types, increasing data diversity, and incorporating extensive evaluation metrics. This work introduces the Interspeech 2025 URGENT Challenge, the second edition of the series, to explore several aspects that have received limited attention so far: language dependency, universality for more distortion types, data scalability, and the effectiveness of using noisy training data. We received 32 submissions, where the best system uses a discriminative model, while most other competitive ones are hybrid methods. Analysis reveals some key findings: (i) some generative or hybrid approaches are preferred in subjective evaluations over the top discriminative model, and (ii) purely generative SE models can exhibit language dependency.

* Accepted to Interspeech 2025

Via

Access Paper or Ask Questions

Is MixIT Really Unsuitable for Correlated Sources? Exploring MixIT for Unsupervised Pre-training in Music Source Separation

May 12, 2025

Kohei Saijo, Yoshiaki Bando

Abstract:In music source separation (MSS), obtaining isolated sources or stems is highly costly, making pre-training on unlabeled data a promising approach. Although source-agnostic unsupervised learning like mixture-invariant training (MixIT) has been explored in general sound separation, they have been largely overlooked in MSS due to its implicit assumption of source independence. We hypothesize, however, that the difficulty of applying MixIT to MSS arises from the ill-posed nature of MSS itself, where stem definitions are application-dependent and models lack explicit knowledge of what should or should not be separated, rather than from high inter-source correlation. While MixIT does not assume any source model and struggles with such ambiguities, our preliminary experiments show that it can still separate instruments to some extent, suggesting its potential for unsupervised pre-training. Motivated by these insights, this study investigates MixIT-based pre-training for MSS. We first pre-train a model on in-the-wild, unlabeled data from the Free Music Archive using MixIT, and then fine-tune it on MUSDB18 with supervision. Using the band-split TF-Locoformer, one of the state-of-the-art MSS models, we demonstrate that MixIT-based pre-training improves the performance over training from scratch.

* 5 pages, 1 figure, 3 tables

Via

Access Paper or Ask Questions

A Comparative Study on Positional Encoding for Time-frequency Domain Dual-path Transformer-based Source Separation Models

Apr 28, 2025

Kohei Saijo, Tetsuji Ogawa

Abstract:In this study, we investigate the impact of positional encoding (PE) on source separation performance and the generalization ability to long sequences (length extrapolation) in Transformer-based time-frequency (TF) domain dual-path models. The length extrapolation capability in TF-domain dual-path models is a crucial factor, as it affects not only their performance on long-duration inputs but also their generalizability to signals with unseen sampling rates. While PE is known to significantly impact length extrapolation, there has been limited research that explores the choice of PEs for TF-domain dual-path models from this perspective. To address this gap, we compare various PE methods using a recent state-of-the-art model, TF-Locoformer, as the base architecture. Our analysis yields the following key findings: (i) When handling sequences that are the same length as or shorter than those seen during training, models with PEs achieve better performance. (ii) However, models without PE exhibit superior length extrapolation. This trend is particularly pronounced when the model contains convolutional layers.

* 5 pages, 3 tables, 2 figures

Via

Access Paper or Ask Questions

Task-Aware Unified Source Separation

Oct 31, 2024

Kohei Saijo, Janek Ebbers, François G. Germain, Gordon Wichern, Jonathan Le Roux

Abstract:Several attempts have been made to handle multiple source separation tasks such as speech enhancement, speech separation, sound event separation, music source separation (MSS), or cinematic audio source separation (CASS) with a single model. These models are trained on large-scale data including speech, instruments, or sound events and can often successfully separate a wide range of sources. However, it is still challenging for such models to cover all separation tasks because some of them are contradictory (e.g., musical instruments are separated in MSS while they have to be grouped in CASS). To overcome this issue and support all the major separation tasks, we propose a task-aware unified source separation (TUSS) model. The model uses a variable number of learnable prompts to specify which source to separate, and changes its behavior depending on the given prompts, enabling it to handle all the major separation tasks including contradictory ones. Experimental results demonstrate that the proposed TUSS model successfully handles the five major separation tasks mentioned earlier. We also provide some audio examples, including both synthetic mixtures and real recordings, to demonstrate how flexibly the TUSS model changes its behavior at inference depending on the prompts.

* Submitted to ICASSP 2025

Via

Access Paper or Ask Questions

Enhanced Reverberation as Supervision for Unsupervised Speech Separation

Aug 06, 2024

Kohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux

Figure 1 for Enhanced Reverberation as Supervision for Unsupervised Speech Separation

Figure 2 for Enhanced Reverberation as Supervision for Unsupervised Speech Separation

Figure 3 for Enhanced Reverberation as Supervision for Unsupervised Speech Separation

Figure 4 for Enhanced Reverberation as Supervision for Unsupervised Speech Separation

Abstract:Reverberation as supervision (RAS) is a framework that allows for training monaural speech separation models from multi-channel mixtures in an unsupervised manner. In RAS, models are trained so that sources predicted from a mixture at an input channel can be mapped to reconstruct a mixture at a target channel. However, stable unsupervised training has so far only been achieved in over-determined source-channel conditions, leaving the key determined case unsolved. This work proposes enhanced RAS (ERAS) for solving this problem. Through qualitative analysis, we found that stable training can be achieved by leveraging the loss term to alleviate the frequency-permutation problem. Separation performance is also boosted by adding a novel loss term where separated signals mapped back to their own input mixture are used as pseudo-targets for the signals separated from other channels and mapped to the same channel. Experimental results demonstrate high stability and performance of ERAS.

* Accepted to Interspeech 2024

Via

Access Paper or Ask Questions

TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Aug 06, 2024

Kohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux

Figure 1 for TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Figure 2 for TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Figure 3 for TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Figure 4 for TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Abstract:Time-frequency (TF) domain dual-path models achieve high-fidelity speech separation. While some previous state-of-the-art (SoTA) models rely on RNNs, this reliance means they lack the parallelizability, scalability, and versatility of Transformer blocks. Given the wide-ranging success of pure Transformer-based architectures in other fields, in this work we focus on removing the RNN from TF-domain dual-path models, while maintaining SoTA performance. This work presents TF-Locoformer, a Transformer-based model with LOcal-modeling by COnvolution. The model uses feed-forward networks (FFNs) with convolution layers, instead of linear layers, to capture local information, letting the self-attention focus on capturing global patterns. We place two such FFNs before and after self-attention to enhance the local-modeling capability. We also introduce a novel normalization for TF-domain dual-path models. Experiments on separation and enhancement datasets show that the proposed model meets or exceeds SoTA in multiple benchmarks with an RNN-free architecture.

* Accepted to IWAENC 2024

Via

Access Paper or Ask Questions

URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Jun 07, 2024

Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe(+2 more)

Figure 1 for URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Figure 2 for URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Figure 3 for URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Figure 4 for URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Abstract:The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generalizability of SE. We aim to extend the SE definition to cover different sub-tasks to explore the limits of SE models, starting from denoising, dereverberation, bandwidth extension, and declipping. A novel framework is proposed to unify all these sub-tasks in a single model, allowing the use of all existing SE approaches. We collected public speech and noise data from different domains to construct diverse evaluation data. Finally, we discuss the insights gained from our preliminary baseline experiments based on both generative and discriminative SE methods with 12 curated metrics.

* 6 pages, 3 figures, 3 tables. Accepted by Interspeech 2024. An extended version of the accepted manuscript with appendix

Via

Access Paper or Ask Questions

Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

Jun 06, 2024

Wangyou Zhang, Kohei Saijo, Jee-weon Jung, Chenda Li, Shinji Watanabe, Yanmin Qian

Figure 1 for Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

Figure 2 for Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

Figure 3 for Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

Figure 4 for Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

Abstract:Deep learning-based speech enhancement (SE) models have achieved impressive performance in the past decade. Numerous advanced architectures have been designed to deliver state-of-the-art performance; however, their scalability potential remains unrevealed. Meanwhile, the majority of research focuses on small-sized datasets with restricted diversity, leading to a plateau in performance improvement. In this paper, we aim to provide new insights for addressing the above issues by exploring the scalability of SE models in terms of architectures, model sizes, compute budgets, and dataset sizes. Our investigation involves several popular SE architectures and speech data from different domains. Experiments reveal both similarities and distinctions between the scaling effects in SE and other tasks such as speech recognition. These findings further provide insights into the under-explored SE directions, e.g., larger-scale multi-domain corpora and efficiently scalable architectures.

* 5 pages, 3 figures, 4 tables, Accepted by Interspeech 2024

Via

Access Paper or Ask Questions