Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gordon Wichern

Enhanced Reverberation as Supervision for Unsupervised Speech Separation

Aug 06, 2024

Kohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux

Figure 1 for Enhanced Reverberation as Supervision for Unsupervised Speech Separation

Figure 2 for Enhanced Reverberation as Supervision for Unsupervised Speech Separation

Figure 3 for Enhanced Reverberation as Supervision for Unsupervised Speech Separation

Figure 4 for Enhanced Reverberation as Supervision for Unsupervised Speech Separation

Abstract:Reverberation as supervision (RAS) is a framework that allows for training monaural speech separation models from multi-channel mixtures in an unsupervised manner. In RAS, models are trained so that sources predicted from a mixture at an input channel can be mapped to reconstruct a mixture at a target channel. However, stable unsupervised training has so far only been achieved in over-determined source-channel conditions, leaving the key determined case unsolved. This work proposes enhanced RAS (ERAS) for solving this problem. Through qualitative analysis, we found that stable training can be achieved by leveraging the loss term to alleviate the frequency-permutation problem. Separation performance is also boosted by adding a novel loss term where separated signals mapped back to their own input mixture are used as pseudo-targets for the signals separated from other channels and mapped to the same channel. Experimental results demonstrate high stability and performance of ERAS.

* Accepted to Interspeech 2024

Via

Access Paper or Ask Questions

TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Aug 06, 2024

Kohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux

Figure 1 for TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Figure 2 for TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Figure 3 for TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Figure 4 for TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

Abstract:Time-frequency (TF) domain dual-path models achieve high-fidelity speech separation. While some previous state-of-the-art (SoTA) models rely on RNNs, this reliance means they lack the parallelizability, scalability, and versatility of Transformer blocks. Given the wide-ranging success of pure Transformer-based architectures in other fields, in this work we focus on removing the RNN from TF-domain dual-path models, while maintaining SoTA performance. This work presents TF-Locoformer, a Transformer-based model with LOcal-modeling by COnvolution. The model uses feed-forward networks (FFNs) with convolution layers, instead of linear layers, to capture local information, letting the self-attention focus on capturing global patterns. We place two such FFNs before and after self-attention to enhance the local-modeling capability. We also introduce a novel normalization for TF-domain dual-path models. Experiments on separation and enhancement datasets show that the proposed model meets or exceeds SoTA in multiple benchmarks with an RNN-free architecture.

* Accepted to IWAENC 2024

Via

Access Paper or Ask Questions

Sound Event Bounding Boxes

Jun 06, 2024

Janek Ebbers, Francois G. Germain, Gordon Wichern, Jonathan Le Roux

Abstract:Sound event detection is the task of recognizing sounds and determining their extent (onset/offset times) within an audio clip. Existing systems commonly predict sound presence confidence in short time frames. Then, thresholding produces binary frame-level presence decisions, with the extent of individual events determined by merging consecutive positive frames. In this paper, we show that frame-level thresholding degrades the prediction of the event extent by coupling it with the system's sound presence confidence. We propose to decouple the prediction of event extent and confidence by introducing SEBBs, which format each sound event prediction as a tuple of a class type, extent, and overall confidence. We also propose a change-detection-based algorithm to convert legacy frame-level outputs into SEBBs. We find the algorithm significantly improves the performance of DCASE 2023 Challenge systems, boosting the state of the art from .644 to .686 PSDS1.

* Accepted for publication at Interspeech 2024

Via

Access Paper or Ask Questions

SMITIN: Self-Monitored Inference-Time INtervention for Generative Music Transformers

Apr 02, 2024

Junghyun Koo, Gordon Wichern, Francois G. Germain, Sameer Khurana, Jonathan Le Roux

Figure 1 for SMITIN: Self-Monitored Inference-Time INtervention for Generative Music Transformers

Figure 2 for SMITIN: Self-Monitored Inference-Time INtervention for Generative Music Transformers

Figure 3 for SMITIN: Self-Monitored Inference-Time INtervention for Generative Music Transformers

Figure 4 for SMITIN: Self-Monitored Inference-Time INtervention for Generative Music Transformers

Abstract:We introduce Self-Monitored Inference-Time INtervention (SMITIN), an approach for controlling an autoregressive generative music transformer using classifier probes. These simple logistic regression probes are trained on the output of each attention head in the transformer using a small dataset of audio examples both exhibiting and missing a specific musical trait (e.g., the presence/absence of drums, or real/synthetic music). We then steer the attention heads in the probe direction, ensuring the generative model output captures the desired musical trait. Additionally, we monitor the probe output to avoid adding an excessive amount of intervention into the autoregressive generation, which could lead to temporally incoherent music. We validate our results objectively and subjectively for both audio continuation and text-to-music applications, demonstrating the ability to add controls to large generative models for which retraining or even fine-tuning is impractical for most musicians. Audio samples of the proposed intervention approach are available on our demo page http://tinyurl.com/smitin .

Via

Access Paper or Ask Questions

Why does music source separation benefit from cacophony?

Feb 28, 2024

Chang-Bin Jeon, Gordon Wichern, François G. Germain, Jonathan Le Roux

Figure 1 for Why does music source separation benefit from cacophony?

Figure 2 for Why does music source separation benefit from cacophony?

Figure 3 for Why does music source separation benefit from cacophony?

Figure 4 for Why does music source separation benefit from cacophony?

Abstract:In music source separation, a standard training data augmentation procedure is to create new training samples by randomly combining instrument stems from different songs. These random mixes have mismatched characteristics compared to real music, e.g., the different stems do not have consistent beat or tonality, resulting in a cacophony. In this work, we investigate why random mixing is effective when training a state-of-the-art music source separation model in spite of the apparent distribution shift it creates. Additionally, we examine why performance levels off despite potentially limitless combinations, and examine the sensitivity of music source separation performance to differences in beat and tonality of the instrumental sources in a mixture.

* ICASSP 2024 Workshop on Explainable AI for Speech and Audio

Via

Access Paper or Ask Questions

NIIRF: Neural IIR Filter Field for HRTF Upsampling and Personalization

Feb 27, 2024

Yoshiki Masuyama, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux

Abstract:Head-related transfer functions (HRTFs) are important for immersive audio, and their spatial interpolation has been studied to upsample finite measurements. Recently, neural fields (NFs) which map from sound source direction to HRTF have gained attention. Existing NF-based methods focused on estimating the magnitude of the HRTF from a given sound source direction, and the magnitude is converted to a finite impulse response (FIR) filter. We propose the neural infinite impulse response filter field (NIIRF) method that instead estimates the coefficients of cascaded IIR filters. IIR filters mimic the modal nature of HRTFs, thus needing fewer coefficients to approximate them well compared to FIR filters. We find that our method can match the performance of existing NF-based methods on multiple datasets, even outperforming them when measurements are sparse. We also explore approaches to personalize the NF to a subject and experimentally find low-rank adaptation to be effective.

* Accepted to ICASSP 2024

Via

Access Paper or Ask Questions

NeuroHeed+: Improving Neuro-steered Speaker Extraction with Joint Auditory Attention Detection

Dec 12, 2023

Zexu Pan, Gordon Wichern, Francois G. Germain, Sameer Khurana, Jonathan Le Roux

Figure 1 for NeuroHeed+: Improving Neuro-steered Speaker Extraction with Joint Auditory Attention Detection

Figure 2 for NeuroHeed+: Improving Neuro-steered Speaker Extraction with Joint Auditory Attention Detection

Figure 3 for NeuroHeed+: Improving Neuro-steered Speaker Extraction with Joint Auditory Attention Detection

Figure 4 for NeuroHeed+: Improving Neuro-steered Speaker Extraction with Joint Auditory Attention Detection

Abstract:Neuro-steered speaker extraction aims to extract the listener's brain-attended speech signal from a multi-talker speech signal, in which the attention is derived from the cortical activity. This activity is usually recorded using electroencephalography (EEG) devices. Though promising, current methods often have a high speaker confusion error, where the interfering speaker is extracted instead of the attended speaker, degrading the listening experience. In this work, we aim to reduce the speaker confusion error in the neuro-steered speaker extraction model through a jointly fine-tuned auxiliary auditory attention detection model. The latter reinforces the consistency between the extracted target speech signal and the EEG representation, and also improves the EEG representation. Experimental results show that the proposed network significantly outperforms the baseline in terms of speaker confusion and overall signal quality in two-talker scenarios.

Via

Access Paper or Ask Questions

Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

Oct 30, 2023

Zexu Pan, Gordon Wichern, Yoshiki Masuyama, Francois G. Germain, Sameer Khurana, Chiori Hori, Jonathan Le Roux

Figure 1 for Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

Figure 2 for Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

Figure 3 for Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

Figure 4 for Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

Abstract:Target speech extraction aims to extract, based on a given conditioning cue, a target speech signal that is corrupted by interfering sources, such as noise or competing speakers. Building upon the achievements of the state-of-the-art (SOTA) time-frequency speaker separation model TF-GridNet, we propose AV-GridNet, a visual-grounded variant that incorporates the face recording of a target speaker as a conditioning factor during the extraction process. Recognizing the inherent dissimilarities between speech and noise signals as interfering sources, we also propose SAV-GridNet, a scenario-aware model that identifies the type of interfering scenario first and then applies a dedicated expert model trained specifically for that scenario. Our proposed model achieves SOTA results on the second COG-MHEAR Audio-Visual Speech Enhancement Challenge, outperforming other models by a significant margin, objectively and in a listening test. We also perform an extensive analysis of the results under the two scenarios.

* Accepted by ASRU 2023

Via

Access Paper or Ask Questions

Generation or Replication: Auscultating Audio Latent Diffusion Models

Oct 16, 2023

Dimitrios Bralios, Gordon Wichern, François G. Germain, Zexu Pan, Sameer Khurana, Chiori Hori, Jonathan Le Roux

Figure 1 for Generation or Replication: Auscultating Audio Latent Diffusion Models

Figure 2 for Generation or Replication: Auscultating Audio Latent Diffusion Models

Figure 3 for Generation or Replication: Auscultating Audio Latent Diffusion Models

Figure 4 for Generation or Replication: Auscultating Audio Latent Diffusion Models

Abstract:The introduction of audio latent diffusion models possessing the ability to generate realistic sound clips on demand from a text description has the potential to revolutionize how we work with audio. In this work, we make an initial attempt at understanding the inner workings of audio latent diffusion models by investigating how their audio outputs compare with the training data, similar to how a doctor auscultates a patient by listening to the sounds of their organs. Using text-to-audio latent diffusion models trained on the AudioCaps dataset, we systematically analyze memorization behavior as a function of training set size. We also evaluate different retrieval metrics for evidence of training data memorization, finding the similarity between mel spectrograms to be more robust in detecting matches than learned embedding vectors. In the process of analyzing memorization in audio latent diffusion models, we also discover a large amount of duplicated audio clips within the AudioCaps database.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

Sep 29, 2023

Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, Shinji Watanabe

Figure 1 for Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

Figure 2 for Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

Figure 3 for Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

Figure 4 for Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation

Abstract:Automated audio captioning (AAC) aims to generate informative descriptions for various sounds from nature and/or human activities. In recent years, AAC has quickly attracted research interest, with state-of-the-art systems now relying on a sequence-to-sequence (seq2seq) backbone powered by strong models such as Transformers. Following the macro-trend of applied machine learning research, in this work, we strive to improve the performance of seq2seq AAC models by extensively leveraging pretrained models and large language models (LLMs). Specifically, we utilize BEATs to extract fine-grained audio features. Then, we employ Instructor LLM to fetch text embeddings of captions, and infuse their language-modality knowledge into BEATs audio features via an auxiliary InfoNCE loss function. Moreover, we propose a novel data augmentation method that uses ChatGPT to produce caption mix-ups (i.e., grammatical and compact combinations of two captions) which, together with the corresponding audio mixtures, increase not only the amount but also the complexity and diversity of training data. During inference, we propose to employ nucleus sampling and a hybrid reranking algorithm, which has not been explored in AAC research. Combining our efforts, our model achieves a new state-of-the-art 32.6 SPIDEr-FL score on the Clotho evaluation split, and wins the 2023 DCASE AAC challenge.

* Preprint, under review at ICASSP 2024. Winner of the DCASE 2023 Challenge Task 6A: Automated Audio Captioning (AAC)

Via

Access Paper or Ask Questions