Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Florian Grötschla

VCNAC: A Variable-Channel Neural Audio Codec for Mono, Stereo, and Surround Sound

Jan 21, 2026

Florian Grötschla, Arunasish Sen, Alessandro Lombardi, Guillermo Cámbara, Andreas Schwarz

Abstract:We present VCNAC, a variable channel neural audio codec. Our approach features a single encoder and decoder parametrization that enables native inference for different channel setups, from mono speech to cinematic 5.1 channel surround audio. Channel compatibility objectives ensure that multi-channel content maintains perceptual quality when decoded to fewer channels. The shared representation enables training of generative language models on a single set of codebooks while supporting inference-time scalability across modalities and channel configurations. Evaluation using objective spatial audio metrics and subjective listening tests demonstrates that our unified approach maintains high reconstruction quality across mono, stereo, and surround audio configurations.

* Submitted to EUSIPCO 2026

Via

Access Paper or Ask Questions

Inductive Transfer Learning for Graph-Based Recommenders

Oct 26, 2025

Florian Grötschla, Elia Trachsel, Luca A. Lanzendörfer, Roger Wattenhofer

Figure 1 for Inductive Transfer Learning for Graph-Based Recommenders

Figure 2 for Inductive Transfer Learning for Graph-Based Recommenders

Figure 3 for Inductive Transfer Learning for Graph-Based Recommenders

Figure 4 for Inductive Transfer Learning for Graph-Based Recommenders

Abstract:Graph-based recommender systems are commonly trained in transductive settings, which limits their applicability to new users, items, or datasets. We propose NBF-Rec, a graph-based recommendation model that supports inductive transfer learning across datasets with disjoint user and item sets. Unlike conventional embedding-based methods that require retraining for each domain, NBF-Rec computes node embeddings dynamically at inference time. We evaluate the method on seven real-world datasets spanning movies, music, e-commerce, and location check-ins. NBF-Rec achieves competitive performance in zero-shot settings, where no target domain data is used for training, and demonstrates further improvements through lightweight fine-tuning. These results show that inductive transfer is feasible in graph-based recommendation and that interaction-level message passing supports generalization across datasets without requiring aligned users or items.

* Accepted at the New Perspectives in Graph Machine Learning Workshop at NeurIPS 2025

Via

Access Paper or Ask Questions

SAO-Instruct: Free-form Audio Editing using Natural Language Instructions

Oct 26, 2025

Michael Ungersböck, Florian Grötschla, Luca A. Lanzendörfer, June Young Yi, Changho Choi, Roger Wattenhofer

Abstract:Generative models have made significant progress in synthesizing high-fidelity audio from short textual descriptions. However, editing existing audio using natural language has remained largely underexplored. Current approaches either require the complete description of the edited audio or are constrained to predefined edit instructions that lack flexibility. In this work, we introduce SAO-Instruct, a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions. We demonstrate that SAO-Instruct achieves competitive performance on objective metrics and outperforms other audio editing approaches in a subjective listening study. To encourage future research, we release our code and model weights.

* Accepted at NeurIPS 2025

Via

Access Paper or Ask Questions

Multi-bit Audio Watermarking

Oct 02, 2025

Luca A. Lanzendörfer, Kyle Fearne, Florian Grötschla, Roger Wattenhofer

Figure 1 for Multi-bit Audio Watermarking

Figure 2 for Multi-bit Audio Watermarking

Figure 3 for Multi-bit Audio Watermarking

Figure 4 for Multi-bit Audio Watermarking

Abstract:We present Timbru, a post-hoc audio watermarking model that achieves state-of-the-art robustness and imperceptibility trade-offs without training an embedder-detector model. Given any 44.1 kHz stereo music snippet, our method performs per-audio gradient optimization to add imperceptible perturbations in the latent space of a pretrained audio VAE, guided by a combined message and perceptual loss. The watermark can then be extracted using a pretrained CLAP model. We evaluate 16-bit watermarking on MUSDB18-HQ against AudioSeal, WavMark, and SilentCipher across common filtering, noise, compression, resampling, cropping, and regeneration attacks. Our approach attains the best average bit error rates, while preserving perceptual quality, demonstrating an efficient, dataset-free path to imperceptible audio watermarking.

Via

Access Paper or Ask Questions

Bias beyond Borders: Global Inequalities in AI-Generated Music

Oct 02, 2025

Ahmet Solak, Florian Grötschla, Luca A. Lanzendörfer, Roger Wattenhofer

Abstract:While recent years have seen remarkable progress in music generation models, research on their biases across countries, languages, cultures, and musical genres remains underexplored. This gap is compounded by the lack of datasets and benchmarks that capture the global diversity of music. To address these challenges, we introduce GlobalDISCO, a large-scale dataset consisting of 73k music tracks generated by state-of-the-art commercial generative music models, along with paired links to 93k reference tracks in LAION-DISCO-12M. The dataset spans 147 languages and includes musical style prompts extracted from MusicBrainz and Wikipedia. The dataset is globally balanced, representing musical styles from artists across 79 countries and five continents. Our evaluation reveals large disparities in music quality and alignment with reference music between high-resource and low-resource regions. Furthermore, we find marked differences in model performance between mainstream and geographically niche genres, including cases where models generate music for regional genres that more closely align with the distribution of mainstream styles.

Via

Access Paper or Ask Questions

EuroSpeech: A Multilingual Speech Corpus

Oct 01, 2025

Samuel Pfisterer, Florian Grötschla, Luca A. Lanzendörfer, Florian Yan, Roger Wattenhofer

Abstract:Recent progress in speech processing has highlighted that high-quality performance across languages requires substantial training data for each individual language. While existing multilingual datasets cover many languages, they often contain insufficient data for most languages. Thus, trained models perform poorly on the majority of the supported languages. Our work addresses this challenge by introducing a scalable pipeline for constructing speech datasets from parliamentary recordings. The proposed pipeline includes robust components for media retrieval and a two-stage alignment algorithm designed to handle non-verbatim transcripts and long-form audio. Applying this pipeline to recordings from 22 European parliaments, we extract over 61k hours of aligned speech segments, achieving substantial per-language coverage with 19 languages exceeding 1k hours and 22 languages exceeding 500 hours of high-quality speech data. We obtain an average 41.8\% reduction in word error rates over baselines when finetuning an existing ASR model on our dataset, demonstrating the usefulness of our approach.

* Published in the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Track on Datasets and Benchmark

Via

Access Paper or Ask Questions

Source Separation for A Cappella Music

Sep 30, 2025

Luca A. Lanzendörfer, Constantin Pinkl, Florian Grötschla

Figure 1 for Source Separation for A Cappella Music

Figure 2 for Source Separation for A Cappella Music

Figure 3 for Source Separation for A Cappella Music

Figure 4 for Source Separation for A Cappella Music

Abstract:In this work, we study the task of multi-singer separation in a cappella music, where the number of active singers varies across mixtures. To address this, we use a power set-based data augmentation strategy that expands limited multi-singer datasets into exponentially more training samples. To separate singers, we introduce SepACap, an adaptation of SepReformer, a state-of-the-art speaker separation model architecture. We adapt the model with periodic activations and a composite loss function that remains effective when stems are silent, enabling robust detection and separation. Experiments on the JaCappella dataset demonstrate that our approach achieves state-of-the-art performance in both full-ensemble and subset singer separation scenarios, outperforming spectrogram-based baselines while generalizing to realistic mixtures with varying numbers of singers.

Via

Access Paper or Ask Questions

Parametric Neural Amp Modeling with Active Learning

Jul 02, 2025

Florian Grötschla, Luca A. Lanzendörfer, Longxiang Jiao, Roger Wattenhofer

Abstract:We introduce PANAMA, an active learning framework for the training of end-to-end parametric guitar amp models using a WaveNet-like architecture. With \model, one can create a virtual amp by recording samples that are determined by an active learning strategy to use a minimum amount of datapoints (i.e., amp knob settings). We show that gradient-based optimization algorithms can be used to determine the optimal datapoints to sample, and that the approach helps under a constrained number of samples.

* Accepted at ISMIR 2025 as Late-Breaking Demo (LBD)

Via

Access Paper or Ask Questions

Benchmarking Music Generation Models and Metrics via Human Preference Studies

Jun 23, 2025

Florian Grötschla, Ahmet Solak, Luca A. Lanzendörfer, Roger Wattenhofer

Abstract:Recent advancements have brought generated music closer to human-created compositions, yet evaluating these models remains challenging. While human preference is the gold standard for assessing quality, translating these subjective judgments into objective metrics, particularly for text-audio alignment and music quality, has proven difficult. In this work, we generate 6k songs using 12 state-of-the-art models and conduct a survey of 15k pairwise audio comparisons with 2.5k human participants to evaluate the correlation between human preferences and widely used metrics. To the best of our knowledge, this work is the first to rank current state-of-the-art music generation models and metrics based on human preference. To further the field of subjective metric evaluation, we provide open access to our dataset of generated music and human evaluations.

* In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025
* Accepted at ICASSP 2025

Via

Access Paper or Ask Questions

High-Fidelity Music Vocoder using Neural Audio Codecs

Feb 18, 2025

Luca A. Lanzendörfer, Florian Grötschla, Michael Ungersböck, Roger Wattenhofer

Figure 1 for High-Fidelity Music Vocoder using Neural Audio Codecs

Figure 2 for High-Fidelity Music Vocoder using Neural Audio Codecs

Figure 3 for High-Fidelity Music Vocoder using Neural Audio Codecs

Figure 4 for High-Fidelity Music Vocoder using Neural Audio Codecs

Abstract:While neural vocoders have made significant progress in high-fidelity speech synthesis, their application on polyphonic music has remained underexplored. In this work, we propose DisCoder, a neural vocoder that leverages a generative adversarial encoder-decoder architecture informed by a neural audio codec to reconstruct high-fidelity 44.1 kHz audio from mel spectrograms. Our approach first transforms the mel spectrogram into a lower-dimensional representation aligned with the Descript Audio Codec (DAC) latent space before reconstructing it to an audio signal using a fine-tuned DAC decoder. DisCoder achieves state-of-the-art performance in music synthesis on several objective metrics and in a MUSHRA listening study. Our approach also shows competitive performance in speech synthesis, highlighting its potential as a universal vocoder.

* Accepted at ICASSP 2025

Via

Access Paper or Ask Questions