Alert button
Picture for Gordon Wichern

Gordon Wichern

Alert button

The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track

Aug 14, 2023
Stefan Uhlich, Giorgio Fabbro, Masato Hirano, Shusuke Takahashi, Gordon Wichern, Jonathan Le Roux, Dipam Chakraborty, Sharada Mohanty, Kai Li, Yi Luo, Jianwei Yu, Rongzhi Gu, Roman Solovyev, Alexander Stempkovskiy, Tatiana Habruseva, Mikhail Sukhovei, Yuki Mitsufuji

Figure 1 for The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track
Figure 2 for The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track
Figure 3 for The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track
Figure 4 for The Sound Demixing Challenge 2023 $\unicode{x2013}$ Cinematic Demixing Track

This paper summarizes the cinematic demixing (CDX) track of the Sound Demixing Challenge 2023 (SDX'23). We provide a comprehensive summary of the challenge setup, detailing the structure of the competition and the datasets used. Especially, we detail CDXDB23, a new hidden dataset constructed from real movies that was used to rank the submissions. The paper also offers insights into the most successful approaches employed by participants. Compared to the cocktail-fork baseline, the best-performing system trained exclusively on the simulated Divide and Remaster (DnR) dataset achieved an improvement of 1.8dB in SDR whereas the top performing system on the open leaderboard, where any data could be used for training, saw a significant improvement of 5.7dB.

* under review 
Viaarxiv icon

Pac-HuBERT: Self-Supervised Music Source Separation via Primitive Auditory Clustering and Hidden-Unit BERT

Apr 04, 2023
Ke Chen, Gordon Wichern, François G. Germain, Jonathan Le Roux

Figure 1 for Pac-HuBERT: Self-Supervised Music Source Separation via Primitive Auditory Clustering and Hidden-Unit BERT
Figure 2 for Pac-HuBERT: Self-Supervised Music Source Separation via Primitive Auditory Clustering and Hidden-Unit BERT
Figure 3 for Pac-HuBERT: Self-Supervised Music Source Separation via Primitive Auditory Clustering and Hidden-Unit BERT
Figure 4 for Pac-HuBERT: Self-Supervised Music Source Separation via Primitive Auditory Clustering and Hidden-Unit BERT

In spite of the progress in music source separation research, the small amount of publicly-available clean source data remains a constant limiting factor for performance. Thus, recent advances in self-supervised learning present a largely-unexplored opportunity for improving separation models by leveraging unlabelled music data. In this paper, we propose a self-supervised learning framework for music source separation inspired by the HuBERT speech representation model. We first investigate the potential impact of the original HuBERT model by inserting an adapted version of it into the well-known Demucs V2 time-domain separation model architecture. We then propose a time-frequency-domain self-supervised model, Pac-HuBERT (for primitive auditory clustering HuBERT), that we later use in combination with a Res-U-Net decoder for source separation. Pac-HuBERT uses primitive auditory features of music as unsupervised clustering labels to initialize the self-supervised pretraining process using the Free Music Archive (FMA) dataset. The resulting framework achieves better source-to-distortion ratio (SDR) performance on the MusDB18 test set than the original Demucs V2 and Res-U-Net models. We further demonstrate that it can boost performance with small amounts of supervised data. Ultimately, our proposed framework is an effective solution to the challenge of limited clean source data for music source separation.

* 5 pages, 2 figures, 3 tables 
Viaarxiv icon

TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings

Mar 08, 2023
Christoph Boeddeker, Aswin Shanmugam Subramanian, Gordon Wichern, Reinhold Haeb-Umbach, Jonathan Le Roux

Figure 1 for TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings
Figure 2 for TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings
Figure 3 for TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings
Figure 4 for TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings

Since diarization and source separation of meeting data are closely related tasks, we here propose an approach to perform the two objectives jointly. It builds upon the target-speaker voice activity detection (TS-VAD) diarization approach, which assumes that initial speaker embeddings are available. We replace the final combined speaker activity estimation network of TS-VAD with a network that produces speaker activity estimates at a time-frequency resolution. Those act as masks for source extraction, either via masking or via beamforming. The technique can be applied both for single-channel and multi-channel input and, in both cases, achieves a new state-of-the-art word error rate (WER) on the LibriCSS meeting data recognition task. We further compute speaker-aware and speaker-agnostic WERs to isolate the contribution of diarization errors to the overall WER performance.

* Submitted to IEEE/ACM TASLP 
Viaarxiv icon

Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks

Dec 14, 2022
Darius Petermann, Gordon Wichern, Aswin Shanmugam Subramanian, Zhong-Qiu Wang, Jonathan Le Roux

Figure 1 for Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks
Figure 2 for Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks
Figure 3 for Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks
Figure 4 for Tackling the Cocktail Fork Problem for Separation and Transcription of Real-World Soundtracks

Emulating the human ability to solve the cocktail party problem, i.e., focus on a source of interest in a complex acoustic scene, is a long standing goal of audio source separation research. Much of this research investigates separating speech from noise, speech from speech, musical instruments from each other, or sound events from each other. In this paper, we focus on the cocktail fork problem, which takes a three-pronged approach to source separation by separating an audio mixture such as a movie soundtrack or podcast into the three broad categories of speech, music, and sound effects (SFX - understood to include ambient noise and natural sound events). We benchmark the performance of several deep learning-based source separation models on this task and evaluate them with respect to simple objective measures such as signal-to-distortion ratio (SDR) as well as objective metrics that better correlate with human perception. Furthermore, we thoroughly evaluate how source separation can influence downstream transcription tasks. First, we investigate the task of activity detection on the three sources as a way to both further improve source separation and perform transcription. We formulate the transcription tasks as speech recognition for speech and audio tagging for music and SFX. We observe that, while the use of source separation estimates improves transcription performance in comparison to the original soundtrack, performance is still sub-optimal due to artifacts introduced by the separation process. Therefore, we thoroughly investigate how remixing of the three separated source stems at various relative levels can reduce artifacts and consequently improve the transcription performance. We find that remixing music and SFX interferences at a target SNR of 17.5 dB reduces speech recognition word error rate, and similar impact from remixing is observed for tagging music and SFX content.

* Submitted to IEEE TASLP (In review), 13 pages, 6 figures 
Viaarxiv icon

Hyperbolic Audio Source Separation

Dec 09, 2022
Darius Petermann, Gordon Wichern, Aswin Subramanian, Jonathan Le Roux

Figure 1 for Hyperbolic Audio Source Separation
Figure 2 for Hyperbolic Audio Source Separation
Figure 3 for Hyperbolic Audio Source Separation
Figure 4 for Hyperbolic Audio Source Separation

We introduce a framework for audio source separation using embeddings on a hyperbolic manifold that compactly represent the hierarchical relationship between sound sources and time-frequency features. Inspired by recent successes modeling hierarchical relationships in text and images with hyperbolic embeddings, our algorithm obtains a hyperbolic embedding for each time-frequency bin of a mixture signal and estimates masks using hyperbolic softmax layers. On a synthetic dataset containing mixtures of multiple people talking and musical instruments playing, our hyperbolic model performed comparably to a Euclidean baseline in terms of source to distortion ratio, with stronger performance at low embedding dimensions. Furthermore, we find that time-frequency regions containing multiple overlapping sources are embedded towards the center (i.e., the most uncertain region) of the hyperbolic space, and we can use this certainty estimate to efficiently trade-off between artifact introduction and interference reduction when isolating individual sounds.

* Submitted to ICASSP 2023, Demo page: https://darius522.github.io/hyperbolic-audio-sep/ 
Viaarxiv icon

Latent Iterative Refinement for Modular Source Separation

Nov 22, 2022
Dimitrios Bralios, Efthymios Tzinis, Gordon Wichern, Paris Smaragdis, Jonathan Le Roux

Figure 1 for Latent Iterative Refinement for Modular Source Separation
Figure 2 for Latent Iterative Refinement for Modular Source Separation
Figure 3 for Latent Iterative Refinement for Modular Source Separation
Figure 4 for Latent Iterative Refinement for Modular Source Separation

Traditional source separation approaches train deep neural network models end-to-end with all the data available at once by minimizing the empirical risk on the whole training set. On the inference side, after training the model, the user fetches a static computation graph and runs the full model on some specified observed mixture signal to get the estimated source signals. Additionally, many of those models consist of several basic processing blocks which are applied sequentially. We argue that we can significantly increase resource efficiency during both training and inference stages by reformulating a model's training and inference procedures as iterative mappings of latent signal representations. First, we can apply the same processing block more than once on its output to refine the input signal and consequently improve parameter efficiency. During training, we can follow a block-wise procedure which enables a reduction on memory requirements. Thus, one can train a very complicated network structure using significantly less computation compared to end-to-end training. During inference, we can dynamically adjust how many processing blocks and iterations of a specific block an input signal needs using a gating module.

Viaarxiv icon

Reverberation as Supervision for Speech Separation

Nov 15, 2022
Rohith Aralikatti, Christoph Boeddeker, Gordon Wichern, Aswin Shanmugam Subramanian, Jonathan Le Roux

Figure 1 for Reverberation as Supervision for Speech Separation
Figure 2 for Reverberation as Supervision for Speech Separation
Figure 3 for Reverberation as Supervision for Speech Separation
Figure 4 for Reverberation as Supervision for Speech Separation

This paper proposes reverberation as supervision (RAS), a novel unsupervised loss function for single-channel reverberant speech separation. Prior methods for unsupervised separation required the synthesis of mixtures of mixtures or assumed the existence of a teacher model, making them difficult to consider as potential methods explaining the emergence of separation abilities in an animal's auditory system. We assume the availability of two-channel mixtures at training time, and train a neural network to separate the sources given one of the channels as input such that the other channel may be predicted from the separated sources. As the relationship between the room impulse responses (RIRs) of each channel depends on the locations of the sources, which are unknown to the network, the network cannot rely on learning that relationship. Instead, our proposed loss function fits each of the separated sources to the mixture in the target channel via Wiener filtering, and compares the resulting mixture to the ground-truth one. We show that minimizing the scale-invariant signal-to-distortion ratio (SI-SDR) of the predicted right-channel mixture with respect to the ground truth implicitly guides the network towards separating the left-channel sources. On a semi-supervised reverberant speech separation task based on the WHAMR! dataset, using training data where just 5% (resp., 10%) of the mixtures are labeled with associated isolated sources, we achieve 70% (resp., 78%) of the SI-SDR improvement obtained when training with supervision on the full training set, while a model trained only on the labeled data obtains 43% (resp., 45%).

* 5 pages, 2 figures, 4 tables. Submitted to ICASSP 2023 
Viaarxiv icon

Meta-Learning of Neural State-Space Models Using Data From Similar Systems

Nov 14, 2022
Ankush Chakrabarty, Gordon Wichern, Christopher R. Laughman

Figure 1 for Meta-Learning of Neural State-Space Models Using Data From Similar Systems
Figure 2 for Meta-Learning of Neural State-Space Models Using Data From Similar Systems
Figure 3 for Meta-Learning of Neural State-Space Models Using Data From Similar Systems
Figure 4 for Meta-Learning of Neural State-Space Models Using Data From Similar Systems

Deep neural state-space models (SSMs) provide a powerful tool for modeling dynamical systems solely using operational data. Typically, neural SSMs are trained using data collected from the actual system under consideration, despite the likely existence of operational data from similar systems which have previously been deployed in the field. In this paper, we propose the use of model-agnostic meta-learning (MAML) for constructing deep encoder network-based SSMs, by leveraging a combination of archived data from similar systems (used to meta-train offline) and limited data from the actual system (used for rapid online adaptation). We demonstrate using a numerical example that meta-learning can result in more accurate neural SSM models than supervised- or transfer-learning, despite few adaptation steps and limited online data. Additionally, we show that by carefully partitioning and adapting the encoder layers while fixing the state-transition operator, we can achieve comparable performance to MAML while reducing online adaptation complexity.

* Submitted for conference publication 
Viaarxiv icon

Optimal Condition Training for Target Source Separation

Nov 11, 2022
Efthymios Tzinis, Gordon Wichern, Paris Smaragdis, Jonathan Le Roux

Figure 1 for Optimal Condition Training for Target Source Separation
Figure 2 for Optimal Condition Training for Target Source Separation
Figure 3 for Optimal Condition Training for Target Source Separation

Recent research has shown remarkable performance in leveraging multiple extraneous conditional and non-mutually exclusive semantic concepts for sound source separation, allowing the flexibility to extract a given target source based on multiple different queries. In this work, we propose a new optimal condition training (OCT) method for single-channel target source separation, based on greedy parameter updates using the highest performing condition among equivalent conditions associated with a given target source. Our experiments show that the complementary information carried by the diverse semantic concepts significantly helps to disentangle and isolate sources of interest much more efficiently compared to single-conditioned models. Moreover, we propose a variation of OCT with condition refinement, in which an initial conditional vector is adapted to the given mixture and transformed to a more amenable representation for target source extraction. We showcase the effectiveness of OCT on diverse source separation experiments where it improves upon permutation invariant models with oracle assignment and obtains state-of-the-art performance in the more challenging task of text-based source separation, outperforming even dedicated text-only conditioned models.

* Submitted to ICASSP 2023 
Viaarxiv icon

Cold Diffusion for Speech Enhancement

Nov 04, 2022
Hao Yen, François G. Germain, Gordon Wichern, Jonathan Le Roux

Figure 1 for Cold Diffusion for Speech Enhancement
Figure 2 for Cold Diffusion for Speech Enhancement

Diffusion models have recently shown promising results for difficult enhancement tasks such as the conditional and unconditional restoration of natural images and audio signals. In this work, we explore the possibility of leveraging a recently proposed advanced iterative diffusion model, namely cold diffusion, to recover clean speech signals from noisy signals. The unique mathematical properties of the sampling process from cold diffusion could be utilized to restore high-quality samples from arbitrary degradations. Based on these properties, we propose an improved training algorithm and objective to help the model generalize better during the sampling process. We verify our proposed framework by investigating two model architectures. Experimental results on benchmark speech enhancement dataset VoiceBank-DEMAND demonstrate the strong performance of the proposed approach compared to representative discriminative models and diffusion-based enhancement models.

* 5 pages, 1 figure, 1 table, 3 algorithms. Submitted to ICASSP 2023 
Viaarxiv icon