Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Michael Ungersböck

SAO-Instruct: Free-form Audio Editing using Natural Language Instructions

Oct 26, 2025

Michael Ungersböck, Florian Grötschla, Luca A. Lanzendörfer, June Young Yi, Changho Choi, Roger Wattenhofer

Abstract:Generative models have made significant progress in synthesizing high-fidelity audio from short textual descriptions. However, editing existing audio using natural language has remained largely underexplored. Current approaches either require the complete description of the edited audio or are constrained to predefined edit instructions that lack flexibility. In this work, we introduce SAO-Instruct, a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions. We demonstrate that SAO-Instruct achieves competitive performance on objective metrics and outperforms other audio editing approaches in a subjective listening study. To encourage future research, we release our code and model weights.

* Accepted at NeurIPS 2025

Via

Access Paper or Ask Questions

High-Fidelity Music Vocoder using Neural Audio Codecs

Feb 18, 2025

Luca A. Lanzendörfer, Florian Grötschla, Michael Ungersböck, Roger Wattenhofer

Figure 1 for High-Fidelity Music Vocoder using Neural Audio Codecs

Figure 2 for High-Fidelity Music Vocoder using Neural Audio Codecs

Figure 3 for High-Fidelity Music Vocoder using Neural Audio Codecs

Figure 4 for High-Fidelity Music Vocoder using Neural Audio Codecs

Abstract:While neural vocoders have made significant progress in high-fidelity speech synthesis, their application on polyphonic music has remained underexplored. In this work, we propose DisCoder, a neural vocoder that leverages a generative adversarial encoder-decoder architecture informed by a neural audio codec to reconstruct high-fidelity 44.1 kHz audio from mel spectrograms. Our approach first transforms the mel spectrogram into a lower-dimensional representation aligned with the Descript Audio Codec (DAC) latent space before reconstructing it to an audio signal using a fine-tuned DAC decoder. DisCoder achieves state-of-the-art performance in music synthesis on several objective metrics and in a MUSHRA listening study. Our approach also shows competitive performance in speech synthesis, highlighting its potential as a universal vocoder.

* Accepted at ICASSP 2025

Via

Access Paper or Ask Questions