Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Geoffroy Peeters

IP Paris

S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

Apr 27, 2026

Mohammed Ali El Adlouni, Aurian Quelennec, Pierre Chouteau, Geoffroy Peeters, Slim Essid

Abstract:General audio foundation models have recently achieved remarkable progress, enabling strong performance across diverse tasks. However, state-of-the-art models remain extremely large, often with hundreds of millions of parameters, leading to high inference costs and limited deployability on edge devices. Knowledge distillation is a proven strategy for model compression, but prior work in audio has mostly focused on supervised settings, relying on class logits, intermediate features, or architecture-specific techniques. Such assumptions exclude models that output only embeddings, such as self-supervised or metric-learning models. We introduce S-SONDO (Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models), the first framework to distill general audio models using only their output embeddings. By avoiding the need for logits or layer-level alignment, S-SONDO is architecture-agnostic and broadly applicable to embedding-based teachers. We demonstrate its effectiveness by distilling two audio foundation models into three efficient students that are up to 61 times smaller while retaining up to 96% of teacher performance. We also provide practical insights on loss choice and clustering-based balanced data sampling. Code is available here: https://github.com/MedAliAdlouni/ssondo.

* Accepted at IEEE ICASSP 2026. 5 pages, 2 figures, 3 tables. Equal contribution by first two authors. Code: https://github.com/MedAliAdlouni/ssondo | Models: https://huggingface.co/mohammedali2501/ssondo | Package: https://pypi.org/project/ssondo/

Via

Access Paper or Ask Questions

S-PRESSO: Ultra Low Bitrate Sound Effect Compression With Diffusion Autoencoders And Offline Quantization

Feb 16, 2026

Zineb Lahrichi, Gaëtan Hadjeres, Gaël Richard, Geoffroy Peeters

Abstract:Neural audio compression models have recently achieved extreme compression rates, enabling efficient latent generative modeling. Conversely, latent generative models have been applied to compression, pushing the limits of continuous and discrete approaches. However, existing methods remain constrained to low-resolution audio and degrade substantially at very low bitrates, where audible artifacts are prominent. In this paper, we present S-PRESSO, a 48kHz sound effect compression model that produces both continuous and discrete embeddings at ultra-low bitrates, down to 0.096 kbps, via offline quantization. Our model relies on a pretrained latent diffusion model to decode compressed audio embeddings learned by a latent encoder. Leveraging the generative priors of the diffusion decoder, we achieve extremely low frame rates, down to 1Hz (750x compression rate), producing convincing and realistic reconstructions at the cost of exact fidelity. Despite operating at high compression rates, we demonstrate that S-PRESSO outperforms both continuous and discrete baselines in audio quality, acoustic similarity and reconstruction metrics.

* International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2026, Barcelone, Spain

Via

Access Paper or Ask Questions

Twenty-Five Years of MIR Research: Achievements, Practices, Evaluations, and Future Challenges

Nov 10, 2025

Geoffroy Peeters, Zafar Rafii, Magdalena Fuentes, Zhiyao Duan, Emmanouil Benetos, Juhan Nam, Yuki Mitsufuji

Abstract:In this paper, we trace the evolution of Music Information Retrieval (MIR) over the past 25 years. While MIR gathers all kinds of research related to music informatics, a large part of it focuses on signal processing techniques for music data, fostering a close relationship with the IEEE Audio and Acoustic Signal Processing Technical Commitee. In this paper, we reflect the main research achievements of MIR along the three EDICS related to music analysis, processing and generation. We then review a set of successful practices that fuel the rapid development of MIR research. One practice is the annual research benchmark, the Music Information Retrieval Evaluation eXchange, where participants compete on a set of research tasks. Another practice is the pursuit of reproducible and open research. The active engagement with industry research and products is another key factor for achieving large societal impacts and motivating younger generations of students to join the field. Last but not the least, the commitment to diversity, equity and inclusion ensures MIR to be a vibrant and open community where various ideas, methodologies, and career pathways collide. We finish by providing future challenges MIR will have to face.

Via

Access Paper or Ask Questions

The Inverse Drum Machine: Source Separation Through Joint Transcription and Analysis-by-Synthesis

May 06, 2025

Bernardo Torres, Geoffroy Peeters, Gael Richard

Figure 1 for The Inverse Drum Machine: Source Separation Through Joint Transcription and Analysis-by-Synthesis

Figure 2 for The Inverse Drum Machine: Source Separation Through Joint Transcription and Analysis-by-Synthesis

Figure 3 for The Inverse Drum Machine: Source Separation Through Joint Transcription and Analysis-by-Synthesis

Figure 4 for The Inverse Drum Machine: Source Separation Through Joint Transcription and Analysis-by-Synthesis

Abstract:We introduce the Inverse Drum Machine (IDM), a novel approach to drum source separation that combines analysis-by-synthesis with deep learning. Unlike recent supervised methods that rely on isolated stems, IDM requires only transcription annotations. It jointly optimizes automatic drum transcription and one-shot drum sample synthesis in an end-to-end framework. By convolving synthesized one-shot samples with estimated onsets-mimicking a drum machine-IDM reconstructs individual drum stems and trains a neural network to match the original mixture. Evaluations on the StemGMD dataset show that IDM achieves separation performance on par with state-of-the-art supervised methods, while substantially outperforming matrix decomposition baselines.

Via

Access Paper or Ask Questions

Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning

Feb 17, 2025

Aurian Quelennec, Pierre Chouteau, Geoffroy Peeters, Slim Essid

Figure 1 for Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning

Figure 2 for Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning

Figure 3 for Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning

Figure 4 for Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning

Abstract:Recently, self-supervised learning methods based on masked latent prediction have proven to encode input data into powerful representations. However, during training, the learned latent space can be further transformed to extract higher-level information that could be more suited for downstream classification tasks. Therefore, we propose a new method: MAsked latenT Prediction And Classification (MATPAC), which is trained with two pretext tasks solved jointly. As in previous work, the first pretext task is a masked latent prediction task, ensuring a robust input representation in the latent space. The second one is unsupervised classification, which utilises the latent representations of the first pretext task to match probability distributions between a teacher and a student. We validate the MATPAC method by comparing it to other state-of-the-art proposals and conducting ablations studies. MATPAC reaches state-of-the-art self-supervised learning results on reference audio classification datasets such as OpenMIC, GTZAN, ESC-50 and US8K and outperforms comparable supervised methods results for musical auto-tagging on Magna-tag-a-tune.

* Copyright 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Via

Access Paper or Ask Questions

Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive Architectures

Nov 29, 2024

Alain Riou, Antonin Gagneré, Gaëtan Hadjeres, Stefan Lattner, Geoffroy Peeters

Abstract:In this paper, we tackle the task of musical stem retrieval. Given a musical mix, it consists in retrieving a stem that would fit with it, i.e., that would sound pleasant if played together. To do so, we introduce a new method based on Joint-Embedding Predictive Architectures, where an encoder and a predictor are jointly trained to produce latent representations of a context and predict latent representations of a target. In particular, we design our predictor to be conditioned on arbitrary instruments, enabling our model to perform zero-shot stem retrieval. In addition, we discover that pretraining the encoder using contrastive learning drastically improves the model's performance. We validate the retrieval performances of our model using the MUSDB18 and MoisesDB datasets. We show that it significantly outperforms previous baselines on both datasets, showcasing its ability to support more or less precise (and possibly unseen) conditioning. We also evaluate the learned embeddings on a beat tracking task, demonstrating that they retain temporal structure and local information.

* Submitted to ICASSP 2025

Via

Access Paper or Ask Questions

A Contrastive Self-Supervised Learning scheme for beat tracking amenable to few-shot learning

Nov 06, 2024

Antonin Gagnere, Geoffroy Peeters, Slim Essid

Figure 1 for A Contrastive Self-Supervised Learning scheme for beat tracking amenable to few-shot learning

Figure 2 for A Contrastive Self-Supervised Learning scheme for beat tracking amenable to few-shot learning

Figure 3 for A Contrastive Self-Supervised Learning scheme for beat tracking amenable to few-shot learning

Figure 4 for A Contrastive Self-Supervised Learning scheme for beat tracking amenable to few-shot learning

Abstract:In this paper, we propose a novel Self-Supervised-Learning scheme to train rhythm analysis systems and instantiate it for few-shot beat tracking. Taking inspiration from the Contrastive Predictive Coding paradigm, we propose to train a Log-Mel-Spectrogram Transformer encoder to contrast observations at times separated by hypothesized beat intervals from those that are not. We do this without the knowledge of ground-truth tempo or beat positions, as we rely on the local maxima of a Predominant Local Pulse function, considered as a proxy for Tatum positions, to define candidate anchors, candidate positives (located at a distance of a power of two from the anchor) and negatives (remaining time positions). We show that a model pre-trained using this approach on the unlabeled FMA, MTT and MTG-Jamendo datasets can successfully be fine-tuned in the few-shot regime, i.e. with just a few annotated examples to get a competitive beat-tracking performance.

* ISMIR 2024, Nov 2024, San Francisco, Californ, United States

Via

Access Paper or Ask Questions

Episodic fine-tuning prototypical networks for optimization-based few-shot learning: Application to audio classification

Oct 04, 2024

Xuanyu Zhuang, Geoffroy Peeters, Gaël Richard

Abstract:The Prototypical Network (ProtoNet) has emerged as a popular choice in Few-shot Learning (FSL) scenarios due to its remarkable performance and straightforward implementation. Building upon such success, we first propose a simple (yet novel) method to fine-tune a ProtoNet on the (labeled) support set of the test episode of a C-way-K-shot test episode (without using the query set which is only used for evaluation). We then propose an algorithmic framework that combines ProtoNet with optimization-based FSL algorithms (MAML and Meta-Curvature) to work with such a fine-tuning method. Since optimization-based algorithms endow the target learner model with the ability to fast adaption to only a few samples, we utilize ProtoNet as the target model to enhance its fine-tuning performance with the help of a specifically designed episodic fine-tuning strategy. The experimental results confirm that our proposed models, MAML-Proto and MC-Proto, combined with our unique fine-tuning method, outperform regular ProtoNet by a large margin in few-shot audio classification tasks on the ESC-50 and Speech Commands v2 datasets. We note that although we have only applied our model to the audio domain, it is a general method and can be easily extended to other domains.

* 2024 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 2024), Sep 2024, London (UK), United Kingdom
* Accepted at MLSP 2024

Via

Access Paper or Ask Questions

Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation

Aug 05, 2024

Alain Riou, Stefan Lattner, Gaëtan Hadjeres, Michael Anslow, Geoffroy Peeters

Figure 1 for Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation

Figure 2 for Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation

Figure 3 for Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation

Figure 4 for Stem-JEPA: A Joint-Embedding Predictive Architecture for Musical Stem Compatibility Estimation

Abstract:This paper explores the automated process of determining stem compatibility by identifying audio recordings of single instruments that blend well with a given musical context. To tackle this challenge, we present Stem-JEPA, a novel Joint-Embedding Predictive Architecture (JEPA) trained on a multi-track dataset using a self-supervised learning approach. Our model comprises two networks: an encoder and a predictor, which are jointly trained to predict the embeddings of compatible stems from the embeddings of a given context, typically a mix of several instruments. Training a model in this manner allows its use in estimating stem compatibility - retrieving, aligning, or generating a stem to match a given mix - or for downstream tasks such as genre or key estimation, as the training paradigm requires the model to learn information related to timbre, harmony, and rhythm. We evaluate our model's performance on a retrieval task on the MUSDB18 dataset, testing its ability to find the missing stem from a mix and through a subjective user study. We also show that the learned embeddings capture temporal alignment information and, finally, evaluate the representations learned by our model on several downstream tasks, highlighting that they effectively capture meaningful musical features.

* Proceedings of the 25th International Society for Music Information Retrieval Conference, ISMIR 2024

Via

Access Paper or Ask Questions

Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning

May 14, 2024

Alain Riou, Stefan Lattner, Gaëtan Hadjeres, Geoffroy Peeters

Figure 1 for Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning

Figure 2 for Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning

Figure 3 for Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning

Figure 4 for Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning

Abstract:This paper addresses the problem of self-supervised general-purpose audio representation learning. We explore the use of Joint-Embedding Predictive Architectures (JEPA) for this task, which consists of splitting an input mel-spectrogram into two parts (context and target), computing neural representations for each, and training the neural network to predict the target representations from the context representations. We investigate several design choices within this framework and study their influence through extensive experiments by evaluating our models on various audio classification benchmarks, including environmental sounds, speech and music downstream tasks. We focus notably on which part of the input data is used as context or target and show experimentally that it significantly impacts the model's quality. In particular, we notice that some effective design choices in the image domain lead to poor performance on audio, thus highlighting major differences between these two modalities.

* Self-supervision in Audio, Speech and Beyond workshop, IEEE International Conference on Acoustics, Speech, and Signal Processing, 2024

Via

Access Paper or Ask Questions