Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Minsik Park

JOLT3D: Joint Learning of Talking Heads and 3DMM Parameters with Application to Lip-Sync

Jul 28, 2025

Sungjoon Park, Minsik Park, Haneol Lee, Jaesub Yun, Donggeon Lee

Abstract:In this work, we revisit the effectiveness of 3DMM for talking head synthesis by jointly learning a 3D face reconstruction model and a talking head synthesis model. This enables us to obtain a FACS-based blendshape representation of facial expressions that is optimized for talking head synthesis. This contrasts with previous methods that either fit 3DMM parameters to 2D landmarks or rely on pretrained face reconstruction models. Not only does our approach increase the quality of the generated face, but it also allows us to take advantage of the blendshape representation to modify just the mouth region for the purpose of audio-based lip-sync. To this end, we propose a novel lip-sync pipeline that, unlike previous methods, decouples the original chin contour from the lip-synced chin contour, and reduces flickering near the mouth.

* 10 + 8 pages, 11 figures

Via

Access Paper or Ask Questions

Interpretable Convolutional SyncNet

Sep 02, 2024

Sungjoon Park, Jaesub Yun, Donggeon Lee, Minsik Park

Figure 1 for Interpretable Convolutional SyncNet

Figure 2 for Interpretable Convolutional SyncNet

Figure 3 for Interpretable Convolutional SyncNet

Figure 4 for Interpretable Convolutional SyncNet

Abstract:Because videos in the wild can be out of sync for various reasons, a sync-net is used to bring the video back into sync for tasks that require synchronized videos. Previous state-of-the-art (SOTA) sync-nets use InfoNCE loss, rely on the transformer architecture, or both. Unfortunately, the former makes the model's output difficult to interpret, and the latter is unfriendly with large images, thus limiting the usefulness of sync-nets. In this work, we train a convolutional sync-net using the balanced BCE loss (BBCE), a loss inspired by the binary cross entropy (BCE) and the InfoNCE losses. In contrast to the InfoNCE loss, the BBCE loss does not require complicated sampling schemes. Our model can better handle larger images, and its output can be given a probabilistic interpretation. The probabilistic interpretation allows us to define metrics such as probability at offset and offscreen ratio to evaluate the sync quality of audio-visual (AV) speech datasets. Furthermore, our model achieves SOTA accuracy of $96.5\%$ on the LRS2 dataset and $93.8\%$ on the LRS3 dataset.

* 8+5 pages

Via

Access Paper or Ask Questions