Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vincent Lostanlen

LS2N, LS2N - équipe SIMS, Nantes Univ - ECN

Musical Metamerism with Time--Frequency Scattering

Feb 12, 2026

Vincent Lostanlen, Han Han

Abstract:The concept of metamerism originates from colorimetry, where it describes a sensation of visual similarity between two colored lights despite significant differences in spectral content. Likewise, we propose to call ``musical metamerism'' the sensation of auditory similarity which is elicited by two music fragments which differ in terms of underlying waveforms. In this technical report, we describe a method to generate musical metamers from any audio recording. Our method is based on joint time--frequency scattering in Kymatio, an open-source software in Python which enables GPU computing and automatic differentiation. The advantage of our method is that it does not require any manual preprocessing, such as transcription, beat tracking, or source separation. We provide a mathematical description of JTFS as well as some excerpts from the Kymatio source code. Lastly, we review the prior work on JTFS and draw connections with closely related algorithms, such as spectrotemporal receptive fields (STRF), modulation power spectra (MPS), and Gabor filterbank (GBFB).

* Technical report, 15 pages, 1 figure. Written in November 2024 as part of a collaboration with Henkjan Honing's music cognition group at the University of Amsterdam

Via

Access Paper or Ask Questions

SCRAPL: Scattering Transform with Random Paths for Machine Learning

Feb 11, 2026

Christopher Mitcheltree, Vincent Lostanlen, Emmanouil Benetos, Mathieu Lagrange

Abstract:The Euclidean distance between wavelet scattering transform coefficients (known as paths) provides informative gradients for perceptual quality assessment of deep inverse problems in computer vision, speech, and audio processing. However, these transforms are computationally expensive when employed as differentiable loss functions for stochastic gradient descent due to their numerous paths, which significantly limits their use in neural network training. Against this problem, we propose "Scattering transform with Random Paths for machine Learning" (SCRAPL): a stochastic optimization scheme for efficient evaluation of multivariable scattering transforms. We implement SCRAPL for the joint time-frequency scattering transform (JTFS) which demodulates spectrotemporal patterns at multiple scales and rates, allowing a fine characterization of intermittent auditory textures. We apply SCRAPL to differentiable digital signal processing (DDSP), specifically, unsupervised sound matching of a granular synthesizer and the Roland TR-808 drum machine. We also propose an initialization heuristic based on importance sampling, which adapts SCRAPL to the perceptual content of the dataset, improving neural network convergence and evaluation performance. We make our code and audio samples available and provide SCRAPL as a Python package.

* Accepted to ICLR 2026. Code, audio samples, and Python package provided at https://christhetree.github.io/scrapl/

Via

Access Paper or Ask Questions

Aliasing in Convnets: A Frame-Theoretic Perspective

Jul 08, 2025

Daniel Haider, Vincent Lostanlen, Martin Ehler, Nicki Holighaus, Peter Balazs

Figure 1 for Aliasing in Convnets: A Frame-Theoretic Perspective

Figure 2 for Aliasing in Convnets: A Frame-Theoretic Perspective

Figure 3 for Aliasing in Convnets: A Frame-Theoretic Perspective

Figure 4 for Aliasing in Convnets: A Frame-Theoretic Perspective

Abstract:Using a stride in a convolutional layer inherently introduces aliasing, which has implications for numerical stability and statistical generalization. While techniques such as the parametrizations via paraunitary systems have been used to promote orthogonal convolution and thus ensure Parseval stability, a general analysis of aliasing and its effects on the stability has not been done in this context. In this article, we adapt a frame-theoretic approach to describe aliasing in convolutional layers with 1D kernels, leading to practical estimates for stability bounds and characterizations of Parseval stability, that are tailored to take short kernel sizes into account. From this, we derive two computationally very efficient optimization objectives that promote Parseval stability via systematically suppressing aliasing. Finally, for layers with random kernels, we derive closed-form expressions for the expected value and variance of the terms that describe the aliasing effects, revealing fundamental insights into the aliasing behavior at initialization.

Via

Access Paper or Ask Questions

Machine listening in a neonatal intensive care unit

Sep 16, 2024

Modan Tailleur, Vincent Lostanlen, Jean-Philippe Rivière, Pierre Aumond

Figure 1 for Machine listening in a neonatal intensive care unit

Figure 2 for Machine listening in a neonatal intensive care unit

Figure 3 for Machine listening in a neonatal intensive care unit

Figure 4 for Machine listening in a neonatal intensive care unit

Abstract:Oxygenators, alarm devices, and footsteps are some of the most common sound sources in a hospital. Detecting them has scientific value for environmental psychology but comes with challenges of its own: namely, privacy preservation and limited labeled data. In this paper, we address these two challenges via a combination of edge computing and cloud computing. For privacy preservation, we have designed an acoustic sensor which computes third-octave spectrograms on the fly instead of recording audio waveforms. For sample-efficient machine learning, we have repurposed a pretrained audio neural network (PANN) via spectral transcoding and label space adaptation. A small-scale study in a neonatological intensive care unit (NICU) confirms that the time series of detected events align with another modality of measurement: i.e., electronic badges for parents and healthcare professionals. Hence, this paper demonstrates the feasibility of polyphonic machine listening in a hospital ward while guaranteeing privacy by design.

* DCASE2024 Workshop, Nobutaka Ono; Noboru Harada; Yohei Kawaguchi, Oct 2024, Tokyo, Japan

Via

Access Paper or Ask Questions

Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement

Aug 30, 2024

Daniel Haider, Felix Perfler, Vincent Lostanlen, Martin Ehler, Peter Balazs

Figure 1 for Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement

Figure 2 for Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement

Figure 3 for Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement

Figure 4 for Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement

Abstract:Convolutional layers with 1-D filters are often used as frontend to encode audio signals. Unlike fixed time-frequency representations, they can adapt to the local characteristics of input data. However, 1-D filters on raw audio are hard to train and often suffer from instabilities. In this paper, we address these problems with hybrid solutions, i.e., combining theory-driven and data-driven approaches. First, we preprocess the audio signals via a auditory filterbank, guaranteeing good frequency localization for the learned encoder. Second, we use results from frame theory to define an unsupervised learning objective that encourages energy conservation and perfect reconstruction. Third, we adapt mixed compressed spectral norms as learning objectives to the encoder coefficients. Using these solutions in a low-complexity encoder-mask-decoder model significantly improves the perceptual evaluation of speech quality (PESQ) in speech enhancement.

* Accepted at INTERSPEECH 2024

Via

Access Paper or Ask Questions

STONE: Self-supervised Tonality Estimator

Jul 10, 2024

Yuexuan Kong, Vincent Lostanlen, Gabriel Meseguer-Brocal, Stella Wong, Mathieu Lagrange, Romain Hennequin

Figure 1 for STONE: Self-supervised Tonality Estimator

Figure 2 for STONE: Self-supervised Tonality Estimator

Figure 3 for STONE: Self-supervised Tonality Estimator

Figure 4 for STONE: Self-supervised Tonality Estimator

Abstract:Although deep neural networks can estimate the key of a musical piece, their supervision incurs a massive annotation effort. Against this shortcoming, we present STONE, the first self-supervised tonality estimator. The architecture behind STONE, named ChromaNet, is a convnet with octave equivalence which outputs a key signature profile (KSP) of 12 structured logits. First, we train ChromaNet to regress artificial pitch transpositions between any two unlabeled musical excerpts from the same audio track, as measured as cross-power spectral density (CPSD) within the circle of fifths (CoF). We observe that this self-supervised pretext task leads KSP to correlate with tonal key signature. Based on this observation, we extend STONE to output a structured KSP of 24 logits, and introduce supervision so as to disambiguate major versus minor keys sharing the same key signature. Applying different amounts of supervision yields semi-supervised and fully supervised tonality estimators: i.e., Semi-TONEs and Sup-TONEs. We evaluate these estimators on FMAK, a new dataset of 5489 real-world musical recordings with expert annotation of 24 major and minor keys. We find that Semi-TONE matches the classification accuracy of Sup-TONE with reduced supervision and outperforms it with equal supervision.

Via

Access Paper or Ask Questions

Model-Based Deep Learning for Music Information Research

Jun 17, 2024

Gael Richard, Vincent Lostanlen, Yi-Hsuan Yang, Meinard Müller

Figure 1 for Model-Based Deep Learning for Music Information Research

Figure 2 for Model-Based Deep Learning for Music Information Research

Figure 3 for Model-Based Deep Learning for Music Information Research

Figure 4 for Model-Based Deep Learning for Music Information Research

Abstract:In this article, we investigate the notion of model-based deep learning in the realm of music information research (MIR). Loosely speaking, we refer to the term model-based deep learning for approaches that combine traditional knowledge-based methods with data-driven techniques, especially those based on deep learning, within a diff erentiable computing framework. In music, prior knowledge for instance related to sound production, music perception or music composition theory can be incorporated into the design of neural networks and associated loss functions. We outline three specifi c scenarios to illustrate the application of model-based deep learning in MIR, demonstrating the implementation of such concepts and their potential.

* IEEE Signal Processing Magazine, In press

Via

Access Paper or Ask Questions

Mixture of Mixups for Multi-label Classification of Rare Anuran Sounds

Mar 14, 2024

Ilyass Moummad, Nicolas Farrugia, Romain Serizel, Jeremy Froidevaux, Vincent Lostanlen

Figure 1 for Mixture of Mixups for Multi-label Classification of Rare Anuran Sounds

Figure 2 for Mixture of Mixups for Multi-label Classification of Rare Anuran Sounds

Figure 3 for Mixture of Mixups for Multi-label Classification of Rare Anuran Sounds

Figure 4 for Mixture of Mixups for Multi-label Classification of Rare Anuran Sounds

Abstract:Multi-label imbalanced classification poses a significant challenge in machine learning, particularly evident in bioacoustics where animal sounds often co-occur, and certain sounds are much less frequent than others. This paper focuses on the specific case of classifying anuran species sounds using the dataset AnuraSet, that contains both class imbalance and multi-label examples. To address these challenges, we introduce Mixture of Mixups (Mix2), a framework that leverages mixing regularization methods Mixup, Manifold Mixup, and MultiMix. Experimental results show that these methods, individually, may lead to suboptimal results; however, when applied randomly, with one selected at each training iteration, they prove effective in addressing the mentioned challenges, particularly for rare classes with few occurrences. Further analysis reveals that Mix2 is also proficient in classifying sounds across various levels of class co-occurrences.

Via

Access Paper or Ask Questions

Learning to Solve Inverse Problems for Perceptual Sound Matching

Nov 23, 2023

Han Han, Vincent Lostanlen, Mathieu Lagrange

Figure 1 for Learning to Solve Inverse Problems for Perceptual Sound Matching

Figure 2 for Learning to Solve Inverse Problems for Perceptual Sound Matching

Figure 3 for Learning to Solve Inverse Problems for Perceptual Sound Matching

Figure 4 for Learning to Solve Inverse Problems for Perceptual Sound Matching

Abstract:Perceptual sound matching (PSM) aims to find the input parameters to a synthesizer so as to best imitate an audio target. Deep learning for PSM optimizes a neural network to analyze and reconstruct prerecorded samples. In this context, our article addresses the problem of designing a suitable loss function when the training set is generated by a differentiable synthesizer. Our main contribution is perceptual-neural-physical loss (PNP), which aims at addressing a tradeoff between perceptual relevance and computational efficiency. The key idea behind PNP is to linearize the effect of synthesis parameters upon auditory features in the vicinity of each training sample. The linearization procedure is massively paralellizable, can be precomputed, and offers a 100-fold speedup during gradient descent compared to differentiable digital signal processing (DDSP). We demonstrate PNP on two datasets of nonstationary sounds: an AM/FM arpeggiator and a physical model of rectangular membranes. We show that PNP is able to accelerate DDSP with joint time-frequency scattering transform (JTFS) as auditory feature, while preserving its perceptual fidelity. Additionally, we evaluate the impact of other design choices in PSM: parameter rescaling, pretraining, auditory representation, and gradient clipping. We report state-of-the-art results on both datasets and find that PNP-accelerated JTFS has greater influence on PSM performance than any other design choice.

Via

Access Paper or Ask Questions

Energy Preservation and Stability of Random Filterbanks

Sep 11, 2023

Daniel Haider, Vincent Lostanlen, Martin Ehler, Peter Balazs

Figure 1 for Energy Preservation and Stability of Random Filterbanks

Figure 2 for Energy Preservation and Stability of Random Filterbanks

Figure 3 for Energy Preservation and Stability of Random Filterbanks

Figure 4 for Energy Preservation and Stability of Random Filterbanks

Abstract:What makes waveform-based deep learning so hard? Despite numerous attempts at training convolutional neural networks (convnets) for filterbank design, they often fail to outperform hand-crafted baselines. This is all the more surprising because these baselines are linear time-invariant systems: as such, their transfer functions could be accurately represented by a convnet with a large receptive field. In this article, we elaborate on the statistical properties of simple convnets from the mathematical perspective of random convolutional operators. We find that FIR filterbanks with random Gaussian weights are ill-conditioned for large filters and locally periodic input signals, which both are typical in audio signal processing applications. Furthermore, we observe that expected energy preservation of a random filterbank is not sufficient for numerical stability and derive theoretical bounds for its expected frame bounds.

* 4 pages, 5 figures, 1 page appendix

Via

Access Paper or Ask Questions