Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiawei Jin

VividVoice: A Unified Framework for Scene-Aware Visually-Driven Speech Synthesis

Feb 01, 2026

Chengyuan Ma, Jiawei Jin, Ruijie Xiong, Chunxiang Jin, Canxiang Yan, Wenming Yang

Abstract:We introduce and define a novel task-Scene-Aware Visually-Driven Speech Synthesis, aimed at addressing the limitations of existing speech generation models in creating immersive auditory experiences that align with the real physical world. To tackle the two core challenges of data scarcity and modality decoupling, we propose VividVoice, a unified generative framework. First, we constructed a large-scale, high-quality hybrid multimodal dataset, Vivid-210K, which, through an innovative programmatic pipeline, establishes a strong correlation between visual scenes, speaker identity, and audio for the first time. Second, we designed a core alignment module, D-MSVA, which leverages a decoupled memory bank architecture and a cross-modal hybrid supervision strategy to achieve fine-grained alignment from visual scenes to timbre and environmental acoustic features. Both subjective and objective experimental results provide strong evidence that VividVoice significantly outperforms existing baseline models in terms of audio fidelity, content clarity, and multimodal consistency. Our demo is available at https://chengyuann.github.io/VividVoice/.

* Accepted by ICASSP 2026

Via

Access Paper or Ask Questions

"In This Environment, As That Speaker": A Text-Driven Framework for Multi-Attribute Speech Conversion

Jun 08, 2025

Jiawei Jin, Zhuhan Yang, Yixuan Zhou, Zhiyong Wu

Figure 1 for "In This Environment, As That Speaker": A Text-Driven Framework for Multi-Attribute Speech Conversion

Figure 2 for "In This Environment, As That Speaker": A Text-Driven Framework for Multi-Attribute Speech Conversion

Figure 3 for "In This Environment, As That Speaker": A Text-Driven Framework for Multi-Attribute Speech Conversion

Figure 4 for "In This Environment, As That Speaker": A Text-Driven Framework for Multi-Attribute Speech Conversion

Abstract:We propose TES-VC (Text-driven Environment and Speaker controllable Voice Conversion), a text-driven voice conversion framework with independent control of speaker timbre and environmental acoustics. TES-VC processes simultaneous text inputs for target voice and environment, accurately generating speech matching described timbre/environment while preserving source content. Trained on synthetic data with decoupled vocal/environment features via latent diffusion modeling, our method eliminates interference between attributes. The Retrieval-Based Timbre Control (RBTC) module enables precise manipulation using abstract descriptions without paired data. Experiments confirm TES-VC effectively generates contextually appropriate speech in both timbre and environment with high content retention and superior controllability which demonstrates its potential for widespread applications.

* Accepted by Interspeech2025

Via

Access Paper or Ask Questions