Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:CAST-TTS: A Simple Cross-Attention Framework for Unified Timbre Control in TTS

Mar 17, 2026

Zihao Zheng, Wen Wu, Chao Zhang, Mengyue Wu, Xuenan Xu

Share this with someone who'll enjoy it:

Abstract:Current Text-to-Speech (TTS) systems typically use separate models for speech-prompted and text-prompted timbre control. While unifying both control signals into a single model is desirable, the challenge of cross-modal alignment often results in overly complex architectures and training objective. To address this challenge, we propose CAST-TTS, a simple yet effective framework for unified timbre control. Features are extracted from speech prompts and text prompts using pre-trained encoders. The multi-stage training strategy efficiently aligns the speech and projected text representations within a shared embedding space. A single cross-attention mechanism then allows the model to use either of these representations to control the timbre. Extensive experiments validate that the unified cross-attention mechanism is critical for achieving high-quality synthesis. CAST-TTS achieves performance comparable to specialized single-input models while operating within a unified architecture. The demo page can be accessed at https://HiRookie9.github.io/CAST-TTS-Page.

* Submitted to Interspeech 2026

View paper on

Share this with someone who'll enjoy it:

Title:CAST-TTS: A Simple Cross-Attention Framework for Unified Timbre Control in TTS

Paper and Code