Abstract:Music source restoration (MSR) aims to recover unprocessed stems from mixed and mastered recordings. The challenge lies in both separating overlapping sources and reconstructing signals degraded by production effects such as compression and reverberation. We therefore propose DTT-BSR, a hybrid generative adversarial network (GAN) combining rotary positional embeddings (RoPE) transformer for long-term temporal modeling with dual-path band-split recurrent neural network (RNN) for multi-resolution spectral processing. Our model achieved 3rd place on the objective leaderboard and 4th place on the subjective leaderboard on the ICASSP 2026 MSR Challenge, demonstrating exceptional generation fidelity and semantic alignment with a compact size of 7.1M parameters.
Abstract:Emotional expression in human speech is nuanced and compositional, often involving multiple, sometimes conflicting, affective cues that may diverge from linguistic content. In contrast, most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing affective diversity and suppressing mixed or text-emotion-misaligned expression. While activation steering via latent direction vectors offers a promising solution, it remains unclear whether emotion representations are linearly steerable in TTS, where steering should be applied within hybrid TTS architectures, and how such complex emotion behaviors should be evaluated. This paper presents the first systematic analysis of activation steering for emotional control in hybrid TTS models, introducing a quantitative, controllable steering framework, and multi-rater evaluation protocols that enable composable mixed-emotion synthesis and reliable text-emotion mismatch synthesis. Our results demonstrate, for the first time, that emotional prosody and expressive variability are primarily synthesized by the TTS language module instead of the flow-matching module, and also provide a lightweight steering approach for generating natural, human-like emotional speech.