Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ningning Pan

DTT-BSR: GAN-based DTTNet with RoPE Transformer Enhancement for Music Source Restoration

Feb 23, 2026

Shihong Tan, Haoyu Wang, Youran Ni, Yingzhao Hou, Jiayue Luo, Zipei Hu, Han Dou, Zerui Han, Ningning Pan, Yuzhu Wang(+1 more)

Abstract:Music source restoration (MSR) aims to recover unprocessed stems from mixed and mastered recordings. The challenge lies in both separating overlapping sources and reconstructing signals degraded by production effects such as compression and reverberation. We therefore propose DTT-BSR, a hybrid generative adversarial network (GAN) combining rotary positional embeddings (RoPE) transformer for long-term temporal modeling with dual-path band-split recurrent neural network (RNN) for multi-resolution spectral processing. Our model achieved 3rd place on the objective leaderboard and 4th place on the subjective leaderboard on the ICASSP 2026 MSR Challenge, demonstrating exceptional generation fidelity and semantic alignment with a compact size of 7.1M parameters.

Via

Access Paper or Ask Questions

TTMBA: Towards Text To Multiple Sources Binaural Audio Generation

Jul 22, 2025

Yuxuan He, Xiaoran Yang, Ningning Pan, Gongping Huang

Figure 1 for TTMBA: Towards Text To Multiple Sources Binaural Audio Generation

Figure 2 for TTMBA: Towards Text To Multiple Sources Binaural Audio Generation

Figure 3 for TTMBA: Towards Text To Multiple Sources Binaural Audio Generation

Figure 4 for TTMBA: Towards Text To Multiple Sources Binaural Audio Generation

Abstract:Most existing text-to-audio (TTA) generation methods produce mono outputs, neglecting essential spatial information for immersive auditory experiences. To address this issue, we propose a cascaded method for text-to-multisource binaural audio generation (TTMBA) with both temporal and spatial control. First, a pretrained large language model (LLM) segments the text into a structured format with time and spatial details for each sound event. Next, a pretrained mono audio generation network creates multiple mono audios with varying durations for each event. These mono audios are transformed into binaural audios using a binaural rendering neural network based on spatial data from the LLM. Finally, the binaural audios are arranged by their start times, resulting in multisource binaural audio. Experimental results demonstrate the superiority of the proposed method in terms of both audio generation quality and spatial perceptual accuracy.

* 5 pages,3 figures,2 tables

Via

Access Paper or Ask Questions