Alert button
Picture for Jordie Shier

Jordie Shier

Alert button

Differentiable Modelling of Percussive Audio with Transient and Spectral Synthesis

Sep 13, 2023
Jordie Shier, Franco Caspe, Andrew Robertson, Mark Sandler, Charalampos Saitis, Andrew McPherson

Figure 1 for Differentiable Modelling of Percussive Audio with Transient and Spectral Synthesis
Figure 2 for Differentiable Modelling of Percussive Audio with Transient and Spectral Synthesis
Figure 3 for Differentiable Modelling of Percussive Audio with Transient and Spectral Synthesis
Figure 4 for Differentiable Modelling of Percussive Audio with Transient and Spectral Synthesis

Differentiable digital signal processing (DDSP) techniques, including methods for audio synthesis, have gained attention in recent years and lend themselves to interpretability in the parameter space. However, current differentiable synthesis methods have not explicitly sought to model the transient portion of signals, which is important for percussive sounds. In this work, we present a unified synthesis framework aiming to address transient generation and percussive synthesis within a DDSP framework. To this end, we propose a model for percussive synthesis that builds on sinusoidal modeling synthesis and incorporates a modulated temporal convolutional network for transient generation. We use a modified sinusoidal peak picking algorithm to generate time-varying non-harmonic sinusoids and pair it with differentiable noise and transient encoders that are jointly trained to reconstruct drumset sounds. We compute a set of reconstruction metrics using a large dataset of acoustic and electronic percussion samples that show that our method leads to improved onset signal reconstruction for membranophone percussion instruments.

* To be published in The Proceedings of Forum Acusticum, Sep 2023, Turin, Italy 
Viaarxiv icon

A Review of Differentiable Digital Signal Processing for Music & Speech Synthesis

Aug 29, 2023
Ben Hayes, Jordie Shier, György Fazekas, Andrew McPherson, Charalampos Saitis

Figure 1 for A Review of Differentiable Digital Signal Processing for Music & Speech Synthesis
Figure 2 for A Review of Differentiable Digital Signal Processing for Music & Speech Synthesis
Figure 3 for A Review of Differentiable Digital Signal Processing for Music & Speech Synthesis
Figure 4 for A Review of Differentiable Digital Signal Processing for Music & Speech Synthesis

The term "differentiable digital signal processing" describes a family of techniques in which loss function gradients are backpropagated through digital signal processors, facilitating their integration into neural networks. This article surveys the literature on differentiable audio signal processing, focusing on its use in music & speech synthesis. We catalogue applications to tasks including music performance rendering, sound matching, and voice transformation, discussing the motivations for and implications of the use of this methodology. This is accompanied by an overview of digital signal processing operations that have been implemented differentiably. Finally, we highlight open challenges, including optimisation pathologies, robustness to real-world conditions, and design trade-offs, and discuss directions for future research.

* Under review for Frontiers in Signal Processing 
Viaarxiv icon

HEAR 2021: Holistic Evaluation of Audio Representations

Mar 26, 2022
Joseph Turian, Jordie Shier, Humair Raj Khan, Bhiksha Raj, Björn W. Schuller, Christian J. Steinmetz, Colin Malloy, George Tzanetakis, Gissel Velarde, Kirk McNally, Max Henry, Nicolas Pinto, Camille Noufi, Christian Clough, Dorien Herremans, Eduardo Fonseca, Jesse Engel, Justin Salamon, Philippe Esling, Pranay Manocha, Shinji Watanabe, Zeyu Jin, Yonatan Bisk

Figure 1 for HEAR 2021: Holistic Evaluation of Audio Representations
Figure 2 for HEAR 2021: Holistic Evaluation of Audio Representations
Figure 3 for HEAR 2021: Holistic Evaluation of Audio Representations
Figure 4 for HEAR 2021: Holistic Evaluation of Audio Representations

What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR 2021 NeurIPS challenge is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR 2021 evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music. In the spirit of shared exchange, each participant submitted an audio embedding model following a common API that is general-purpose, open-source, and freely available to use. Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets. Open evaluation code, submitted models and datasets are key contributions, enabling comprehensive and reproducible evaluation, as well as previously impossible longitudinal studies. It still remains an open question whether one single general-purpose audio representation can perform as holistically as the human ear.

* to appear in Proceedings of Machine Learning Research (PMLR): NeurIPS 2021 Competition Track 
Viaarxiv icon

One Billion Audio Sounds from GPU-enabled Modular Synthesis

Apr 27, 2021
Joseph Turian, Jordie Shier, George Tzanetakis, Kirk McNally, Max Henry

Figure 1 for One Billion Audio Sounds from GPU-enabled Modular Synthesis
Figure 2 for One Billion Audio Sounds from GPU-enabled Modular Synthesis
Figure 3 for One Billion Audio Sounds from GPU-enabled Modular Synthesis
Figure 4 for One Billion Audio Sounds from GPU-enabled Modular Synthesis

We release synth1B1, a multi-modal audio corpus consisting of 1 billion 4-second synthesized sounds, which is 100x larger than any audio dataset in the literature. Each sound is paired with the corresponding latent parameters used to generate it. synth1B1 samples are deterministically generated on-the-fly 16200x faster than real-time (714MHz) on a single GPU using torchsynth (https://github.com/torchsynth/torchsynth), an open-source modular synthesizer we release. Additionally, we release two new audio datasets: FM synth timbre (https://zenodo.org/record/4677102) and subtractive synth pitch (https://zenodo.org/record/4677097). Using these datasets, we demonstrate new rank-based synthesizer-motivated evaluation criteria for existing audio representations. Finally, we propose novel approaches to synthesizer hyperparameter optimization, and demonstrate how perceptually-correlated auditory distances could enable new applications in synthesizer design.

Viaarxiv icon