Alert button
Picture for Joan Serrà

Joan Serrà

Alert button

GASS: Generalizing Audio Source Separation with Large-scale Data

Sep 29, 2023
Jordi Pons, Xiaoyu Liu, Santiago Pascual, Joan Serrà

Figure 1 for GASS: Generalizing Audio Source Separation with Large-scale Data
Figure 2 for GASS: Generalizing Audio Source Separation with Large-scale Data
Figure 3 for GASS: Generalizing Audio Source Separation with Large-scale Data
Figure 4 for GASS: Generalizing Audio Source Separation with Large-scale Data

Universal source separation targets at separating the audio sources of an arbitrary mix, removing the constraint to operate on a specific domain like speech or music. Yet, the potential of universal source separation is limited because most existing works focus on mixes with predominantly sound events, and small training datasets also limit its potential for supervised learning. Here, we study a single general audio source separation (GASS) model trained to separate speech, music, and sound events in a supervised fashion with a large-scale dataset. We assess GASS models on a diverse set of tasks. Our strong in-distribution results show the feasibility of GASS models, and the competitive out-of-distribution performance in sound event and speech separation shows its generalization abilities. Yet, it is challenging for GASS models to generalize for separating out-of-distribution cinematic and music content. We also fine-tune GASS models on each dataset and consistently outperform the ones without pre-training. All fine-tuned models (except the music separation one) obtain state-of-the-art results in their respective benchmarks.

Viaarxiv icon

Mono-to-stereo through parametric stereo generation

Jun 26, 2023
Joan Serrà, Davide Scaini, Santiago Pascual, Daniel Arteaga, Jordi Pons, Jeroen Breebaart, Giulio Cengarle

Figure 1 for Mono-to-stereo through parametric stereo generation
Figure 2 for Mono-to-stereo through parametric stereo generation
Figure 3 for Mono-to-stereo through parametric stereo generation
Figure 4 for Mono-to-stereo through parametric stereo generation

Generating a stereophonic presentation from a monophonic audio signal is a challenging open task, especially if the goal is to obtain a realistic spatial imaging with a specific panning of sound elements. In this work, we propose to convert mono to stereo by means of predicting parametric stereo (PS) parameters using both nearest neighbor and deep network approaches. In combination with PS, we also propose to model the task with generative approaches, allowing to synthesize multiple and equally-plausible stereo renditions from the same mono signal. To achieve this, we consider both autoregressive and masked token modelling approaches. We provide evidence that the proposed PS-based models outperform a competitive classical decorrelation baseline and that, within a PS prediction framework, modern generative models outshine equivalent non-generative counterparts. Overall, our work positions both PS and generative modelling as strong and appealing methodologies for mono-to-stereo upmixing. A discussion of the limitations of these approaches is also provided.

* 7 pages, 1 figure; accepted for ISMIR23 
Viaarxiv icon

CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

Jun 16, 2023
Hao-Wen Dong, Xiaoyu Liu, Jordi Pons, Gautam Bhattacharya, Santiago Pascual, Joan Serrà, Taylor Berg-Kirkpatrick, Julian McAuley

Figure 1 for CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models
Figure 2 for CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models
Figure 3 for CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models
Figure 4 for CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models

Recent work has studied text-to-audio synthesis using large amounts of paired text-audio data. However, audio recordings with high-quality text annotations can be difficult to acquire. In this work, we approach text-to-audio synthesis using unlabeled videos and pretrained language-vision models. We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge. We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining (CLIP) model. At test time, we first explore performing a zero-shot modality transfer and condition the diffusion model with a CLIP-encoded text query. However, we observe a noticeable performance drop with respect to image queries. To close this gap, we further adopt a pretrained diffusion prior model to generate a CLIP image embedding given a CLIP text embedding. Our results show the effectiveness of the proposed method, and that the pretrained diffusion prior can reduce the modality transfer gap. While we focus on text-to-audio synthesis, the proposed model can also generate audio from image queries, and it shows competitive performance against a state-of-the-art image-to-audio synthesis model in a subjective listening test. This study offers a new direction of approaching text-to-audio synthesis that leverages the naturally-occurring audio-visual correspondence in videos and the power of pretrained language-vision models.

Viaarxiv icon

Full-band General Audio Synthesis with Score-based Diffusion

Oct 26, 2022
Santiago Pascual, Gautam Bhattacharya, Chunghsin Yeh, Jordi Pons, Joan Serrà

Recent works have shown the capability of deep generative models to tackle general audio synthesis from a single label, producing a variety of impulsive, tonal, and environmental sounds. Such models operate on band-limited signals and, as a result of an autoregressive approach, they are typically conformed by pre-trained latent encoders and/or several cascaded modules. In this work, we propose a diffusion-based generative model for general audio synthesis, named DAG, which deals with full-band signals end-to-end in the waveform domain. Results show the superiority of DAG over existing label-conditioned generators in terms of both quality and diversity. More specifically, when compared to the state of the art, the band-limited and full-band versions of DAG achieve relative improvements that go up to 40 and 65%, respectively. We believe DAG is flexible enough to accommodate different conditioning schemas while providing good quality synthesis.

Viaarxiv icon

Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation

Oct 26, 2022
Xiaoyu Liu, Xu Li, Joan Serrà

Figure 1 for Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation
Figure 2 for Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation
Figure 3 for Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation
Figure 4 for Quantitative Evidence on Overlooked Aspects of Enrollment Speaker Embeddings for Target Speaker Separation

Single channel target speaker separation (TSS) aims at extracting a speaker's voice from a mixture of multiple talkers given an enrollment utterance of that speaker. A typical deep learning TSS framework consists of an upstream model that obtains enrollment speaker embeddings and a downstream model that performs the separation conditioned on the embeddings. In this paper, we look into several important but overlooked aspects of the enrollment embeddings, including the suitability of the widely used speaker identification embeddings, the introduction of the log-mel filterbank and self-supervised embeddings, and the embeddings' cross-dataset generalization capability. Our results show that the speaker identification embeddings could lose relevant information due to a sub-optimal metric, training objective, or common pre-processing. In contrast, both the filterbank and the self-supervised embeddings preserve the integrity of the speaker information, but the former consistently outperforms the latter in a cross-dataset evaluation. The competitive separation and generalization performance of the previously overlooked filterbank embedding is consistent across our study, which calls for future research on better upstream features.

* Submitted version to ICASSP 2023 
Viaarxiv icon

Universal Speech Enhancement with Score-based Diffusion

Jun 07, 2022
Joan Serrà, Santiago Pascual, Jordi Pons, R. Oguz Araz, Davide Scaini

Figure 1 for Universal Speech Enhancement with Score-based Diffusion
Figure 2 for Universal Speech Enhancement with Score-based Diffusion
Figure 3 for Universal Speech Enhancement with Score-based Diffusion
Figure 4 for Universal Speech Enhancement with Score-based Diffusion

Removing background noise from speech audio has been the subject of considerable research and effort, especially in recent years due to the rise of virtual communication and amateur sound recording. Yet background noise is not the only unpleasant disturbance that can prevent intelligibility: reverb, clipping, codec artifacts, problematic equalization, limited bandwidth, or inconsistent loudness are equally disturbing and ubiquitous. In this work, we propose to consider the task of speech enhancement as a holistic endeavor, and present a universal speech enhancement system that tackles 55 different distortions at the same time. Our approach consists of a generative model that employs score-based diffusion, together with a multi-resolution conditioning network that performs enhancement with mixture density networks. We show that this approach significantly outperforms the state of the art in a subjective test performed by expert listeners. We also show that it achieves competitive objective scores with just 4-8 diffusion steps, despite not considering any particular strategy for fast sampling. We hope that both our methodology and technical contributions encourage researchers and practitioners to adopt a universal approach to speech enhancement, possibly framing it as a generative task.

* 23 pages, 6 figures; includes appendix; examples in https://serrjoa.github.io/projects/universe/ 
Viaarxiv icon

On loss functions and evaluation metrics for music source separation

Feb 16, 2022
Enric Gusó, Jordi Pons, Santiago Pascual, Joan Serrà

Figure 1 for On loss functions and evaluation metrics for music source separation
Figure 2 for On loss functions and evaluation metrics for music source separation
Figure 3 for On loss functions and evaluation metrics for music source separation
Figure 4 for On loss functions and evaluation metrics for music source separation

We investigate which loss functions provide better separations via benchmarking an extensive set of those for music source separation. To that end, we first survey the most representative audio source separation losses we identified, to later consistently benchmark them in a controlled experimental setup. We also explore using such losses as evaluation metrics, via cross-correlating them with the results of a subjective test. Based on the observation that the standard signal-to-distortion ratio metric can be misleading in some scenarios, we study alternative evaluation metrics based on the considered losses.

* Accepted to ICASSP 2022 
Viaarxiv icon

Upsampling layers for music source separation

Nov 23, 2021
Jordi Pons, Joan Serrà, Santiago Pascual, Giulio Cengarle, Daniel Arteaga, Davide Scaini

Figure 1 for Upsampling layers for music source separation
Figure 2 for Upsampling layers for music source separation
Figure 3 for Upsampling layers for music source separation
Figure 4 for Upsampling layers for music source separation

Upsampling artifacts are caused by problematic upsampling layers and due to spectral replicas that emerge while upsampling. Also, depending on the used upsampling layer, such artifacts can either be tonal artifacts (additive high-frequency noise) or filtering artifacts (substractive, attenuating some bands). In this work we investigate the practical implications of having upsampling artifacts in the resulting audio, by studying how different artifacts interact and assessing their impact on the models' performance. To that end, we benchmark a large set of upsampling layers for music source separation: different transposed and subpixel convolution setups, different interpolation upsamplers (including two novel layers based on stretch and sinc interpolation), and different wavelet-based upsamplers (including a novel learnable wavelet layer). Our results show that filtering artifacts, associated with interpolation upsamplers, are perceptually preferrable, even if they tend to achieve worse objective scores.

* Demo page: http://www.jordipons.me/apps/upsamplers/ 
Viaarxiv icon

Assessing Algorithmic Biases for Musical Version Identification

Sep 30, 2021
Furkan Yesiler, Marius Miron, Joan Serrà, Emilia Gómez

Figure 1 for Assessing Algorithmic Biases for Musical Version Identification
Figure 2 for Assessing Algorithmic Biases for Musical Version Identification
Figure 3 for Assessing Algorithmic Biases for Musical Version Identification
Figure 4 for Assessing Algorithmic Biases for Musical Version Identification

Version identification (VI) systems now offer accurate and scalable solutions for detecting different renditions of a musical composition, allowing the use of these systems in industrial applications and throughout the wider music ecosystem. Such use can have an important impact on various stakeholders regarding recognition and financial benefits, including how royalties are circulated for digital rights management. In this work, we take a step toward acknowledging this impact and consider VI systems as socio-technical systems rather than isolated technologies. We propose a framework for quantifying performance disparities across 5 systems and 6 relevant side attributes: gender, popularity, country, language, year, and prevalence. We also consider 3 main stakeholders for this particular information retrieval use case: the performing artists of query tracks, those of reference (original) tracks, and the composers. By categorizing the recordings in our dataset using such attributes and stakeholders, we analyze whether the considered VI systems show any implicit biases. We find signs of disparities in identification performance for most of the groups we include in our analyses. Moreover, we also find that learning- and rule-based systems behave differently for some attributes, which suggests an additional dimension to consider along with accuracy and scalability when evaluating VI systems. Lastly, we share our dataset with attribute annotations to encourage VI researchers to take these aspects into account while building new systems.

Viaarxiv icon

Audio-based Musical Version Identification: Elements and Challenges

Sep 06, 2021
Furkan Yesiler, Guillaume Doras, Rachel M. Bittner, Christopher J. Tralie, Joan Serrà

Figure 1 for Audio-based Musical Version Identification: Elements and Challenges
Figure 2 for Audio-based Musical Version Identification: Elements and Challenges
Figure 3 for Audio-based Musical Version Identification: Elements and Challenges
Figure 4 for Audio-based Musical Version Identification: Elements and Challenges

In this article, we aim to provide a review of the key ideas and approaches proposed in 20 years of scientific literature around musical version identification (VI) research and connect them to current practice. For more than a decade, VI systems suffered from the accuracy-scalability trade-off, with attempts to increase accuracy that typically resulted in cumbersome, non-scalable systems. Recent years, however, have witnessed the rise of deep learning-based approaches that take a step toward bridging the accuracy-scalability gap, yielding systems that can realistically be deployed in industrial applications. Although this trend positively influences the number of researchers and institutions working on VI, it may also result in obscuring the literature before the deep learning era. To appreciate two decades of novel ideas in VI research and to facilitate building better systems, we now review some of the successful concepts and applications proposed in the literature and study their evolution throughout the years.

* Accepted to be published in IEEE Signal Processing Magazine 
Viaarxiv icon