Alert button
Picture for Cem Subakan

Cem Subakan

Alert button

CL-MASR: A Continual Learning Benchmark for Multilingual ASR

Oct 25, 2023
Luca Della Libera, Pooneh Mousavi, Salah Zaiem, Cem Subakan, Mirco Ravanelli

Figure 1 for CL-MASR: A Continual Learning Benchmark for Multilingual ASR
Figure 2 for CL-MASR: A Continual Learning Benchmark for Multilingual ASR
Figure 3 for CL-MASR: A Continual Learning Benchmark for Multilingual ASR
Figure 4 for CL-MASR: A Continual Learning Benchmark for Multilingual ASR

Modern multilingual automatic speech recognition (ASR) systems like Whisper have made it possible to transcribe audio in multiple languages with a single model. However, current state-of-the-art ASR models are typically evaluated on individual languages or in a multi-task setting, overlooking the challenge of continually learning new languages. There is insufficient research on how to add new languages without losing valuable information from previous data. Furthermore, existing continual learning benchmarks focus mostly on vision and language tasks, leaving continual learning for multilingual ASR largely unexplored. To bridge this gap, we propose CL-MASR, a benchmark designed for studying multilingual ASR in a continual learning setting. CL-MASR provides a diverse set of continual learning methods implemented on top of large-scale pretrained ASR models, along with common metrics to assess the effectiveness of learning new languages while addressing the issue of catastrophic forgetting. To the best of our knowledge, CL-MASR is the first continual learning benchmark for the multilingual ASR task. The code is available at https://github.com/speechbrain/benchmarks.

* 16 pages, 5 figures, 5 tables 
Viaarxiv icon

Audio Editing with Non-Rigid Text Prompts

Oct 19, 2023
Francesco Paissan, Zhepei Wang, Mirco Ravanelli, Paris Smaragdis, Cem Subakan

In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform Audio-LDM, a recently released text-prompted audio generation model. Qualitative inspection of the results points out that the edits given by our approach remain more faithful to the input audio in terms of keeping the original onsets and offsets of the audio events.

Viaarxiv icon

CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice

May 29, 2023
Juan Zuluaga-Gomez, Sara Ahmed, Danielius Visockas, Cem Subakan

Figure 1 for CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice
Figure 2 for CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice
Figure 3 for CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice
Figure 4 for CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice

Despite the recent advancements in Automatic Speech Recognition (ASR), the recognition of accented speech still remains a dominant problem. In order to create more inclusive ASR systems, research has shown that the integration of accent information, as part of a larger ASR framework, can lead to the mitigation of accented speech errors. We address multilingual accent classification through the ECAPA-TDNN and Wav2Vec 2.0/XLSR architectures which have been proven to perform well on a variety of speech-related downstream tasks. We introduce a simple-to-follow recipe aligned to the SpeechBrain toolkit for accent classification based on Common Voice 7.0 (English) and Common Voice 11.0 (Italian, German, and Spanish). Furthermore, we establish new state-of-the-art for English accent classification with as high as 95% accuracy. We also study the internal categorization of the Wav2Vev 2.0 embeddings through t-SNE, noting that there is a level of clustering based on phonological similarity. (Our recipe is open-source in the SpeechBrain toolkit, see: https://github.com/speechbrain/speechbrain/tree/develop/recipes)

* To appear in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2023 
Viaarxiv icon

CryCeleb: A Speaker Verification Dataset Based on Infant Cry Sounds

May 15, 2023
David Budaghyan, Arsenii Gorin, Cem Subakan, Charles C. Onu

Figure 1 for CryCeleb: A Speaker Verification Dataset Based on Infant Cry Sounds
Figure 2 for CryCeleb: A Speaker Verification Dataset Based on Infant Cry Sounds
Figure 3 for CryCeleb: A Speaker Verification Dataset Based on Infant Cry Sounds
Figure 4 for CryCeleb: A Speaker Verification Dataset Based on Infant Cry Sounds

This paper describes the Ubenwa CryCeleb dataset - a labeled collection of infant cries, and the accompanying CryCeleb 2023 task - a public speaker verification challenge based on infant cry sounds. We release for academic usage more than 6 hours of manually segmented cry sounds from 786 newborns to encourage research in infant cry analysis.

Viaarxiv icon

Unsupervised Improvement of Audio-Text Cross-Modal Representations

May 05, 2023
Zhepei Wang, Cem Subakan, Krishna Subramani, Junkai Wu, Tiago Tavares, Fabio Ayres, Paris Smaragdis

Figure 1 for Unsupervised Improvement of Audio-Text Cross-Modal Representations
Figure 2 for Unsupervised Improvement of Audio-Text Cross-Modal Representations
Figure 3 for Unsupervised Improvement of Audio-Text Cross-Modal Representations
Figure 4 for Unsupervised Improvement of Audio-Text Cross-Modal Representations

Recent advances in using language models to obtain cross-modal audio-text representations have overcome the limitations of conventional training approaches that use predefined labels. This has allowed the community to make progress in tasks like zero-shot classification, which would otherwise not be possible. However, learning such representations requires a large amount of human-annotated audio-text pairs. In this paper, we study unsupervised approaches to improve the learning framework of such representations with unpaired text and audio. We explore domain-unspecific and domain-specific curation methods to create audio-text pairs that we use to further improve the model. We also show that when domain-specific curation is used in conjunction with a soft-labeled contrastive loss, we are able to obtain significant improvement in terms of zero-shot classification performance on downstream sound event classification or acoustic scene classification tasks.

* Submitted to WASPAA 2023 
Viaarxiv icon

Self-supervised learning for infant cry analysis

May 02, 2023
Arsenii Gorin, Cem Subakan, Sajjad Abdoli, Junhao Wang, Samantha Latremouille, Charles Onu

Figure 1 for Self-supervised learning for infant cry analysis
Figure 2 for Self-supervised learning for infant cry analysis
Figure 3 for Self-supervised learning for infant cry analysis
Figure 4 for Self-supervised learning for infant cry analysis

In this paper, we explore self-supervised learning (SSL) for analyzing a first-of-its-kind database of cry recordings containing clinical indications of more than a thousand newborns. Specifically, we target cry-based detection of neurological injury as well as identification of cry triggers such as pain, hunger, and discomfort. Annotating a large database in the medical setting is expensive and time-consuming, typically requiring the collaboration of several experts over years. Leveraging large amounts of unlabeled audio data to learn useful representations can lower the cost of building robust models and, ultimately, clinical solutions. In this work, we experiment with self-supervised pre-training of a convolutional neural network on large audio datasets. We show that pre-training with SSL contrastive loss (SimCLR) performs significantly better than supervised pre-training for both neuro injury and cry triggers. In addition, we demonstrate further performance gains through SSL-based domain adaptation using unlabeled infant cries. We also show that using such SSL-based pre-training for adaptation to cry sounds decreases the need for labeled data of the overall system.

* Accepted to IEEE ICASSP 2023 workshop Self-supervision in Audio, Speech and Beyond 
Viaarxiv icon

Posthoc Interpretation via Quantization

Mar 22, 2023
Cem Subakan, Francesco Paissan, Mirco Ravanelli

Figure 1 for Posthoc Interpretation via Quantization
Figure 2 for Posthoc Interpretation via Quantization
Figure 3 for Posthoc Interpretation via Quantization
Figure 4 for Posthoc Interpretation via Quantization

In this paper, we introduce a new approach, called "Posthoc Interpretation via Quantization (PIQ)", for interpreting decisions made by trained classifiers. Our method utilizes vector quantization to transform the representations of a classifier into a discrete, class-specific latent space. The class-specific codebooks act as a bottleneck that forces the interpreter to focus on the parts of the input data deemed relevant by the classifier for making a prediction. We evaluated our method through quantitative and qualitative studies and found that PIQ generates interpretations that are more easily understood by participants to our user studies when compared to several other interpretation methods in the literature.

* * Equal contribution 
Viaarxiv icon

Resource-Efficient Separation Transformer

Jun 19, 2022
Cem Subakan, Mirco Ravanelli, Samuele Cornell, Frédéric Lepoutre, François Grondin

Figure 1 for Resource-Efficient Separation Transformer
Figure 2 for Resource-Efficient Separation Transformer
Figure 3 for Resource-Efficient Separation Transformer
Figure 4 for Resource-Efficient Separation Transformer

Transformers have recently achieved state-of-the-art performance in speech separation. These models, however, are computationally-demanding and require a lot of learnable parameters. This paper explores Transformer-based speech separation with a reduced computational cost. Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture that reduces the computational burden in two ways. First, it uses non-overlapping blocks in the latent space. Second, it operates on compact latent summaries calculated from each chunk. The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings. Remarkably, it scales significantly better than the previous Transformer and RNN-based architectures in terms of memory and inference-time, making it more suitable for processing long mixtures.

* Submitted to IEEE Signal Processing Letters 
Viaarxiv icon

Learning Representations for New Sound Classes With Continual Self-Supervised Learning

May 15, 2022
Zhepei Wang, Cem Subakan, Xilin Jiang, Junkai Wu, Efthymios Tzinis, Mirco Ravanelli, Paris Smaragdis

Figure 1 for Learning Representations for New Sound Classes With Continual Self-Supervised Learning
Figure 2 for Learning Representations for New Sound Classes With Continual Self-Supervised Learning
Figure 3 for Learning Representations for New Sound Classes With Continual Self-Supervised Learning
Figure 4 for Learning Representations for New Sound Classes With Continual Self-Supervised Learning

In this paper, we present a self-supervised learning framework for continually learning representations for new sound classes. The proposed system relies on a continually trained neural encoder that is trained with similarity-based learning objectives without using labels. We show that representations learned with the proposed method generalize better and are less susceptible to catastrophic forgetting than fully-supervised approaches. Remarkably, our technique does not store past data or models and is more computationally efficient than distillation-based methods. To accurately assess the system performance, in addition to using existing protocols, we propose two realistic evaluation protocols that use only a small amount of labeled data to simulate practical use cases.

* Submitted to IEEE Signal Processing Letters 
Viaarxiv icon