Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models

Apr 10, 2026

Pengbo Lyu, Xiangyu Zhao, Chengwei Liu, Haoyin Yan, Xiaotao Liang, Hongyu Wang, Shaofei Xue

Share this with someone who'll enjoy it:

Abstract:We propose a generative framework for multi-track music source separation (MSS) that reformulates the task as conditional discrete token generation. Unlike conventional approaches that directly estimate continuous signals in the time or frequency domain, our method combines a Conformer-based conditional encoder, a dual-path neural audio codec (HCodec), and a decoder-only language model to autoregressively generate audio tokens for four target tracks. The generated tokens are decoded back to waveforms through the codec decoder. Evaluation on the MUSDB18-HQ benchmark shows that our generative approach achieves perceptual quality approaching state-of-the-art discriminative methods, while attaining the highest NISQA score on the vocals track. Ablation studies confirm the effectiveness of the learnable Conformer encoder and the benefit of sequential cross-track generation.

* 5 pages, 2 figures, 3 tables. Submitted to INTERSPEECH 2026

View paper on

Share this with someone who'll enjoy it:

Title:Discrete Token Modeling for Multi-Stem Music Source Separation with Language Models

Paper and Code