Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Long-Khanh Pham

RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling

May 28, 2025

Long-Khanh Pham, Thanh V. T. Tran, Minh-Tan Pham, Van Nguyen

Figure 1 for RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling

Figure 2 for RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling

Figure 3 for RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling

Figure 4 for RESOUND: Speech Reconstruction from Silent Videos via Acoustic-Semantic Decomposed Modeling

Abstract:Lip-to-speech (L2S) synthesis, which reconstructs speech from visual cues, faces challenges in accuracy and naturalness due to limited supervision in capturing linguistic content, accents, and prosody. In this paper, we propose RESOUND, a novel L2S system that generates intelligible and expressive speech from silent talking face videos. Leveraging source-filter theory, our method involves two components: an acoustic path to predict prosody and a semantic path to extract linguistic features. This separation simplifies learning, allowing independent optimization of each representation. Additionally, we enhance performance by integrating speech units, a proven unsupervised speech representation technique, into waveform generation alongside mel-spectrograms. This allows RESOUND to synthesize prosodic speech while preserving content and speaker identity. Experiments conducted on two standard L2S benchmarks confirm the effectiveness of the proposed method across various metrics.

* accepted in Interspeech 2025

Via

Access Paper or Ask Questions