Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis

May 25, 2025

Minsu Kim, Pingchuan Ma, Honglie Chen, Stavros Petridis, Maja Pantic

Share this with someone who'll enjoy it:

Abstract:This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image, and the characteristics of output speech (e.g., pace, noise level, distance, tone, place) can be controllable with natural text description. Specifically, we aim to mitigate the following three challenges in face-driven TTS systems. 1) To overcome the limited audio quality of audio-visual speech corpora, we propose a training method that additionally utilizes high-quality audio-only speech corpora. 2) To generate voices not only from real human faces but also from artistic portraits, we propose augmenting the input face image with stylization. 3) To consider one-to-many possibilities in face-to-voice mapping and ensure consistent voice generation at the same time, we propose to first employ sampling-based decoding and then use prompting with generated speech samples. Experimental results validate the proposed model's effectiveness in face-driven voice synthesis.

* Interspeech 2025

View paper on

Share this with someone who'll enjoy it:

Title:Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis

Paper and Code