Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

Aug 21, 2024

Yihong Lin, Liang Peng, Jianqiao Hu, Xiandong Li, Wenxiong Kang, Songju Lei, Xianjia Wu, Huang Xu

Figure 1 for EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

Figure 2 for EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

Figure 3 for EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

Figure 4 for EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

Share this with someone who'll enjoy it:

Abstract:The creation of increasingly vivid 3D virtual digital humans has become a hot topic in recent years. Currently, most speech-driven work focuses on training models to learn the relationship between phonemes and visemes to achieve more realistic lips. However, they fail to capture the correlations between emotions and facial expressions effectively. To solve this problem, we propose a new model, termed EmoFace. EmoFace employs a novel Mesh Attention mechanism, which helps to learn potential feature dependencies between mesh vertices in time and space. We also adopt, for the first time to our knowledge, an effective self-growing training scheme that combines teacher-forcing and scheduled sampling in a 3D face animation task. Additionally, since EmoFace is an autoregressive model, there is no requirement that the first frame of the training data must be a silent frame, which greatly reduces the data limitations and contributes to solve the current dilemma of insufficient datasets. Comprehensive quantitative and qualitative evaluations on our proposed high-quality reconstructed 3D emotional facial animation dataset, 3D-RAVDESS ($5.0343\times 10^{-5}$mm for LVE and $1.0196\times 10^{-5}$mm for EVE), and publicly available dataset VOCASET ($2.8669\times 10^{-5}$mm for LVE and $0.4664\times 10^{-5}$mm for EVE), demonstrate that our algorithm achieves state-of-the-art performance.

View paper on

Share this with someone who'll enjoy it:

Title:EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

Paper and Code