Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mirco Ravanelli

Audio Editing with Non-Rigid Text Prompts

Oct 19, 2023

Francesco Paissan, Zhepei Wang, Mirco Ravanelli, Paris Smaragdis, Cem Subakan

Abstract:In this paper, we explore audio-editing with non-rigid text edits. We show that the proposed editing pipeline is able to create audio edits that remain faithful to the input audio. We explore text prompts that perform addition, style transfer, and in-painting. We quantitatively and qualitatively show that the edits are able to obtain results which outperform Audio-LDM, a recently released text-prompted audio generation model. Qualitative inspection of the results points out that the edits given by our approach remain more faithful to the input audio in terms of keeping the original onsets and offsets of the audio events.

Via

Access Paper or Ask Questions

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Oct 18, 2023

Dominique Beaini, Shenyang Huang, Joao Alex Cunha, Zhiyi Li, Gabriela Moisescu-Pareja, Oleksandr Dymov, Samuel Maddrell-Mander, Callum McLean, Frederik Wenkel, Luis Müller(+25 more)

Figure 1 for Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Figure 2 for Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Figure 3 for Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Figure 4 for Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Abstract:Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks.

Via

Access Paper or Ask Questions

Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads

Aug 28, 2023

Salah Zaiem, Youcef Kemiche, Titouan Parcollet, Slim Essid, Mirco Ravanelli

Abstract:Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotated data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, while the number of considered tasks has been growing, most proposals rely upon a single downstream architecture that maps the frozen SSL representations to the task labels. This study examines how benchmarking results are affected by changes in the probing head architecture. Interestingly, we found that altering the downstream architecture structure leads to significant fluctuations in the performance ranking of the evaluated models. Against common practices in speech SSL benchmarking, we evaluate larger-capacity probing heads, showing their impact on performance, inference costs, generalization and multi-level feature exploitation.

* 11 Pages

Via

Access Paper or Ask Questions

Generalization Limits of Graph Neural Networks in Identity Effects Learning

Jun 30, 2023

Giuseppe Alessio D'Inverno, Simone Brugiapaglia, Mirco Ravanelli

Figure 1 for Generalization Limits of Graph Neural Networks in Identity Effects Learning

Figure 2 for Generalization Limits of Graph Neural Networks in Identity Effects Learning

Figure 3 for Generalization Limits of Graph Neural Networks in Identity Effects Learning

Figure 4 for Generalization Limits of Graph Neural Networks in Identity Effects Learning

Abstract:Graph Neural Networks (GNNs) have emerged as a powerful tool for data-driven learning on various graph domains. They are usually based on a message-passing mechanism and have gained increasing popularity for their intuitive formulation, which is closely linked to the Weisfeiler-Lehman (WL) test for graph isomorphism to which they have been proven equivalent in terms of expressive power. In this work, we establish new generalization properties and fundamental limits of GNNs in the context of learning so-called identity effects, i.e., the task of determining whether an object is composed of two identical components or not. Our study is motivated by the need to understand the capabilities of GNNs when performing simple cognitive tasks, with potential applications in computational linguistics and chemistry. We analyze two case studies: (i) two-letters words, for which we show that GNNs trained via stochastic gradient descent are unable to generalize to unseen letters when utilizing orthogonal encodings like one-hot representations; (ii) dicyclic graphs, i.e., graphs composed of two cycles, for which we present positive existence results leveraging the connection between GNNs and the WL test. Our theoretical analysis is supported by an extensive numerical study.

* 13 pages, 10 figures

Via

Access Paper or Ask Questions

Speech Emotion Diarization: Which Emotion Appears When?

Jun 22, 2023

Yingzhi Wang, Mirco Ravanelli, Alaa Nfissi, Alya Yacoubi

Abstract:Speech Emotion Recognition (SER) typically relies on utterance-level solutions. However, emotions conveyed through speech should be considered as discrete speech events with definite temporal boundaries, rather than attributes of the entire utterance. To reflect the fine-grained nature of speech emotions, we propose a new task: Speech Emotion Diarization (SED). Just as Speaker Diarization answers the question of "Who speaks when?", Speech Emotion Diarization answers the question of "Which emotion appears when?". To facilitate the evaluation of the performance and establish a common benchmark for researchers, we introduce the Zaion Emotion Dataset (ZED), an openly accessible speech emotion dataset that includes non-acted emotions recorded in real-life conditions, along with manually-annotated boundaries of emotion segments within the utterance. We provide competitive baselines and open-source the code and the pre-trained models.

Via

Access Paper or Ask Questions

RescueSpeech: A German Corpus for Speech Recognition in Search and Rescue Domain

Jun 06, 2023

Sangeet Sagar, Mirco Ravanelli, Bernd Kiefer, Ivana Kruijff Korbayova, Josef van Genabith

Figure 1 for RescueSpeech: A German Corpus for Speech Recognition in Search and Rescue Domain

Figure 2 for RescueSpeech: A German Corpus for Speech Recognition in Search and Rescue Domain

Figure 3 for RescueSpeech: A German Corpus for Speech Recognition in Search and Rescue Domain

Figure 4 for RescueSpeech: A German Corpus for Speech Recognition in Search and Rescue Domain

Abstract:Despite recent advancements in speech recognition, there are still difficulties in accurately transcribing conversational and emotional speech in noisy and reverberant acoustic environments. This poses a particular challenge in the search and rescue (SAR) domain, where transcribing conversations among rescue team members is crucial to support real-time decision-making. The scarcity of speech data and associated background noise in SAR scenarios make it difficult to deploy robust speech recognition systems. To address this issue, we have created and made publicly available a German speech dataset called RescueSpeech. This dataset includes real speech recordings from simulated rescue exercises. Additionally, we have released competitive training recipes and pre-trained models. Our study indicates that the current level of performance achieved by state-of-the-art methods is still far from being acceptable.

Via

Access Paper or Ask Questions

Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

Jun 01, 2023

Salah Zaiem, Youcef Kemiche, Titouan Parcollet, Slim Essid, Mirco Ravanelli

Figure 1 for Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

Figure 2 for Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

Figure 3 for Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

Abstract:Self-supervised learning (SSL) has recently allowed leveraging large datasets of unlabeled speech signals to reach impressive performance on speech tasks using only small amounts of annotated data. The high number of proposed approaches fostered the need and rise of extended benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, and while the number of considered tasks has been growing, most rely upon a single decoding architecture that maps the frozen SSL representations to the downstream labels. This work investigates the robustness of such benchmarking results to changes in the decoder architecture. Interestingly, it appears that varying the architecture of the downstream decoder leads to significant variations in the leaderboards of most tasks. Concerningly, our study reveals that benchmarking using limited decoders may cause a counterproductive increase in the sizes of the developed SSL models.

* INTERSPEECH 2023
* 6 pages

Via

Access Paper or Ask Questions

Simulated Annealing in Early Layers Leads to Better Generalization

Apr 10, 2023

Amirmohammad Sarfi, Zahra Karimpour, Muawiz Chaudhary, Nasir M. Khalid, Mirco Ravanelli, Sudhir Mudur, Eugene Belilovsky

Figure 1 for Simulated Annealing in Early Layers Leads to Better Generalization

Figure 2 for Simulated Annealing in Early Layers Leads to Better Generalization

Figure 3 for Simulated Annealing in Early Layers Leads to Better Generalization

Figure 4 for Simulated Annealing in Early Layers Leads to Better Generalization

Abstract:Recently, a number of iterative learning methods have been introduced to improve generalization. These typically rely on training for longer periods of time in exchange for improved generalization. LLF (later-layer-forgetting) is a state-of-the-art method in this category. It strengthens learning in early layers by periodically re-initializing the last few layers of the network. Our principal innovation in this work is to use Simulated annealing in EArly Layers (SEAL) of the network in place of re-initialization of later layers. Essentially, later layers go through the normal gradient descent process, while the early layers go through short stints of gradient ascent followed by gradient descent. Extensive experiments on the popular Tiny-ImageNet dataset benchmark and a series of transfer learning and few-shot learning tasks show that we outperform LLF by a significant margin. We further show that, compared to normal training, LLF features, although improving on the target task, degrade the transfer learning performance across all datasets we explored. In comparison, our method outperforms LLF across the same target datasets by a large margin. We also show that the prediction depth of our method is significantly lower than that of LLF and normal training, indicating on average better prediction performance.

Via

Access Paper or Ask Questions

Posthoc Interpretation via Quantization

Mar 22, 2023

Cem Subakan, Francesco Paissan, Mirco Ravanelli

Figure 1 for Posthoc Interpretation via Quantization

Figure 2 for Posthoc Interpretation via Quantization

Figure 3 for Posthoc Interpretation via Quantization

Figure 4 for Posthoc Interpretation via Quantization

Abstract:In this paper, we introduce a new approach, called "Posthoc Interpretation via Quantization (PIQ)", for interpreting decisions made by trained classifiers. Our method utilizes vector quantization to transform the representations of a classifier into a discrete, class-specific latent space. The class-specific codebooks act as a bottleneck that forces the interpreter to focus on the parts of the input data deemed relevant by the classifier for making a prediction. We evaluated our method through quantitative and qualitative studies and found that PIQ generates interpretations that are more easily understood by participants to our user studies when compared to several other interpretation methods in the literature.

* * Equal contribution

Via

Access Paper or Ask Questions

Fine-tuning Strategies for Faster Inference using Speech Self-Supervised Models: A Comparative Study

Mar 12, 2023

Salah Zaiem, Robin Algayres, Titouan Parcollet, Slim Essid, Mirco Ravanelli

Abstract:Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. In this context, it has been demonstrated that larger self-supervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better performance might be sanctioned with longer inferences. This article explores different approaches that may be deployed during the fine-tuning to reduce the computations needed in the SSL encoder, leading to faster inferences. We adapt a number of existing techniques to common ASR settings and benchmark them, displaying performance drops and gains in inference times. Interestingly, we found that given enough downstream data, a simple downsampling of the input sequences outperforms the other methods with both low performance drops and high computational savings, reducing computations by 61.3% with an WER increase of only 0.81. Finally, we analyze the robustness of the comparison to changes in dataset conditions, revealing sensitivity to dataset size.

* Submitted to ICASSP "Self-supervision in Audio, Speech and Beyond" workshop

Via

Access Paper or Ask Questions