Alert button
Picture for Pablo Garrido

Pablo Garrido

Alert button

Implicit Neural Head Synthesis via Controllable Local Deformation Fields

Apr 21, 2023
Chuhan Chen, Matthew O'Toole, Gaurav Bharaj, Pablo Garrido

Figure 1 for Implicit Neural Head Synthesis via Controllable Local Deformation Fields
Figure 2 for Implicit Neural Head Synthesis via Controllable Local Deformation Fields
Figure 3 for Implicit Neural Head Synthesis via Controllable Local Deformation Fields
Figure 4 for Implicit Neural Head Synthesis via Controllable Local Deformation Fields

High-quality reconstruction of controllable 3D head avatars from 2D videos is highly desirable for virtual human applications in movies, games, and telepresence. Neural implicit fields provide a powerful representation to model 3D head avatars with personalized shape, expressions, and facial parts, e.g., hair and mouth interior, that go beyond the linear 3D morphable model (3DMM). However, existing methods do not model faces with fine-scale facial features, or local control of facial parts that extrapolate asymmetric expressions from monocular videos. Further, most condition only on 3DMM parameters with poor(er) locality, and resolve local features with a global neural field. We build on part-based implicit shape models that decompose a global deformation field into local ones. Our novel formulation models multiple implicit deformation fields with local semantic rig-like control via 3DMM-based parameters, and representative facial landmarks. Further, we propose a local control loss and attention mask mechanism that promote sparsity of each learned deformation field. Our formulation renders sharper locally controllable nonlinear deformations than previous implicit monocular approaches, especially mouth interior, asymmetric expressions, and facial details.

* Accepted at CVPR 2023 
Viaarxiv icon

Few-shot Geometry-Aware Keypoint Localization

Mar 30, 2023
Xingzhe He, Gaurav Bharaj, David Ferman, Helge Rhodin, Pablo Garrido

Figure 1 for Few-shot Geometry-Aware Keypoint Localization
Figure 2 for Few-shot Geometry-Aware Keypoint Localization
Figure 3 for Few-shot Geometry-Aware Keypoint Localization
Figure 4 for Few-shot Geometry-Aware Keypoint Localization

Supervised keypoint localization methods rely on large manually labeled image datasets, where objects can deform, articulate, or occlude. However, creating such large keypoint labels is time-consuming and costly, and is often error-prone due to inconsistent labeling. Thus, we desire an approach that can learn keypoint localization with fewer yet consistently annotated images. To this end, we present a novel formulation that learns to localize semantically consistent keypoint definitions, even for occluded regions, for varying object categories. We use a few user-labeled 2D images as input examples, which are extended via self-supervision using a larger unlabeled dataset. Unlike unsupervised methods, the few-shot images act as semantic shape constraints for object localization. Furthermore, we introduce 3D geometry-aware constraints to uplift keypoints, achieving more accurate 2D localization. Our general-purpose formulation paves the way for semantically conditioned generative modeling and attains competitive or state-of-the-art accuracy on several datasets, including human faces, eyes, animals, cars, and never-before-seen mouth interior (teeth) localization tasks, not attempted by the previous few-shot methods. Project page: https://xingzhehe.github.io/FewShot3DKP/}{https://xingzhehe.github.io/FewShot3DKP/

* Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023  
* CVPR 2023 
Viaarxiv icon

HQ3DAvatar: High Quality Controllable 3D Head Avatar

Mar 25, 2023
Kartik Teotia, Mallikarjun B R, Xingang Pan, Hyeongwoo Kim, Pablo Garrido, Mohamed Elgharib, Christian Theobalt

Figure 1 for HQ3DAvatar: High Quality Controllable 3D Head Avatar
Figure 2 for HQ3DAvatar: High Quality Controllable 3D Head Avatar
Figure 3 for HQ3DAvatar: High Quality Controllable 3D Head Avatar
Figure 4 for HQ3DAvatar: High Quality Controllable 3D Head Avatar

Multi-view volumetric rendering techniques have recently shown great potential in modeling and synthesizing high-quality head avatars. A common approach to capture full head dynamic performances is to track the underlying geometry using a mesh-based template or 3D cube-based graphics primitives. While these model-based approaches achieve promising results, they often fail to learn complex geometric details such as the mouth interior, hair, and topological changes over time. This paper presents a novel approach to building highly photorealistic digital head avatars. Our method learns a canonical space via an implicit function parameterized by a neural network. It leverages multiresolution hash encoding in the learned feature space, allowing for high-quality, faster training and high-resolution rendering. At test time, our method is driven by a monocular RGB video. Here, an image encoder extracts face-specific features that also condition the learnable canonical space. This encourages deformation-dependent texture variations during training. We also propose a novel optical flow based loss that ensures correspondences in the learned canonical space, thus encouraging artifact-free and temporally consistent renderings. We show results on challenging facial expressions and show free-viewpoint renderings at interactive real-time rates for medium image resolutions. Our method outperforms all existing approaches, both visually and numerically. We will release our multiple-identity dataset to encourage further research. Our Project page is available at: https://vcai.mpi-inf.mpg.de/projects/HQ3DAvatar/

* 16 Pages, 15 Figures. Project page: https://vcai.mpi-inf.mpg.de/projects/HQ3DAvatar/ 
Viaarxiv icon

FML: Face Model Learning from Videos

Dec 18, 2018
Ayush Tewari, Florian Bernard, Pablo Garrido, Gaurav Bharaj, Mohamed Elgharib, Hans-Peter Seidel, Patrick Pérez, Michael Zollhöfer, Christian Theobalt

Figure 1 for FML: Face Model Learning from Videos
Figure 2 for FML: Face Model Learning from Videos
Figure 3 for FML: Face Model Learning from Videos
Figure 4 for FML: Face Model Learning from Videos

Monocular image-based 3D reconstruction of faces is a long-standing problem in computer vision. Since image data is a 2D projection of a 3D face, the resulting depth ambiguity makes the problem ill-posed. Most existing methods rely on data-driven priors that are built from limited 3D face scans. In contrast, we propose multi-frame video-based self-supervised training of a deep network that (i) learns a face identity model both in shape and appearance while (ii) jointly learning to reconstruct 3D faces. Our face model is learned using only corpora of in-the-wild video clips collected from the Internet. This virtually endless source of training data enables learning of a highly general 3D face model. In order to achieve this, we propose a novel multi-frame consistency loss that ensures consistent shape and appearance across multiple frames of a subject's face, thus minimizing depth ambiguity. At test time we can use an arbitrary number of frames, so that we can perform both monocular as well as multi-frame reconstruction.

* Video: https://www.youtube.com/watch?v=SG2BwxCw0lQ, Project Page: https://gvv.mpi-inf.mpg.de/projects/FML19/ 
Viaarxiv icon

Deep Video Portraits

May 29, 2018
Hyeongwoo Kim, Pablo Garrido, Ayush Tewari, Weipeng Xu, Justus Thies, Matthias Nießner, Patrick Pérez, Christian Richardt, Michael Zollhöfer, Christian Theobalt

Figure 1 for Deep Video Portraits
Figure 2 for Deep Video Portraits
Figure 3 for Deep Video Portraits
Figure 4 for Deep Video Portraits

We present a novel approach that enables photo-realistic re-animation of portrait videos using only an input video. In contrast to existing approaches that are restricted to manipulations of facial expressions only, we are the first to transfer the full 3D head position, head rotation, face expression, eye gaze, and eye blinking from a source actor to a portrait video of a target actor. The core of our approach is a generative neural network with a novel space-time architecture. The network takes as input synthetic renderings of a parametric face model, based on which it predicts photo-realistic video frames for a given target actor. The realism in this rendering-to-video transfer is achieved by careful adversarial training, and as a result, we can create modified target videos that mimic the behavior of the synthetically-created input. In order to enable source-to-target video re-animation, we render a synthetic target video with the reconstructed head animation parameters from a source video, and feed it into the trained network -- thus taking full control of the target. With the ability to freely recombine source and target parameters, we are able to demonstrate a large variety of video rewrite applications without explicitly modeling hair, body or background. For instance, we can reenact the full head using interactive user-controlled editing, and realize high-fidelity visual dubbing. To demonstrate the high quality of our output, we conduct an extensive series of experiments and evaluations, where for instance a user study shows that our video edits are hard to detect.

* SIGGRAPH 2018, Video: https://www.youtube.com/watch?v=qc5P2bvfl44 
Viaarxiv icon

Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz

Mar 29, 2018
Ayush Tewari, Michael Zollhöfer, Pablo Garrido, Florian Bernard, Hyeongwoo Kim, Patrick Pérez, Christian Theobalt

Figure 1 for Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz
Figure 2 for Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz
Figure 3 for Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz
Figure 4 for Self-supervised Multi-level Face Model Learning for Monocular Reconstruction at over 250 Hz

The reconstruction of dense 3D models of face geometry and appearance from a single image is highly challenging and ill-posed. To constrain the problem, many approaches rely on strong priors, such as parametric face models learned from limited 3D scan data. However, prior models restrict generalization of the true diversity in facial geometry, skin reflectance and illumination. To alleviate this problem, we present the first approach that jointly learns 1) a regressor for face shape, expression, reflectance and illumination on the basis of 2) a concurrently learned parametric face model. Our multi-level face model combines the advantage of 3D Morphable Models for regularization with the out-of-space generalization of a learned corrective space. We train end-to-end on in-the-wild images without dense annotations by fusing a convolutional encoder with a differentiable expert-designed renderer and a self-supervised training loss, both defined at multiple detail levels. Our approach compares favorably to the state-of-the-art in terms of reconstruction quality, better generalizes to real world faces, and runs at over 250 Hz.

* CVPR 2018 (Oral). Project webpage: https://gvv.mpi-inf.mpg.de/projects/FML/ 
Viaarxiv icon

MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction

Dec 07, 2017
Ayush Tewari, Michael Zollhöfer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard, Patrick Pérez, Christian Theobalt

Figure 1 for MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction
Figure 2 for MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction
Figure 3 for MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction
Figure 4 for MoFA: Model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction

In this work we propose a novel model-based deep convolutional autoencoder that addresses the highly challenging problem of reconstructing a 3D human face from a single in-the-wild color image. To this end, we combine a convolutional encoder network with an expert-designed generative model that serves as decoder. The core innovation is our new differentiable parametric decoder that encapsulates image formation analytically based on a generative model. Our decoder takes as input a code vector with exactly defined semantic meaning that encodes detailed face pose, shape, expression, skin reflectance and scene illumination. Due to this new way of combining CNN-based with model-based face reconstruction, the CNN-based encoder learns to extract semantically meaningful parameters from a single monocular input image. For the first time, a CNN encoder and an expert-designed generative model can be trained end-to-end in an unsupervised manner, which renders training on very large (unlabeled) real world data feasible. The obtained reconstructions compare favorably to current state-of-the-art approaches in terms of quality and richness of representation.

* International Conference on Computer Vision (ICCV) 2017 (Oral), 13 pages 
Viaarxiv icon

Automatic Face Reenactment

Feb 08, 2016
Pablo Garrido, Levi Valgaerts, Ole Rehmsen, Thorsten Thormaehlen, Patrick Perez, Christian Theobalt

Figure 1 for Automatic Face Reenactment
Figure 2 for Automatic Face Reenactment
Figure 3 for Automatic Face Reenactment
Figure 4 for Automatic Face Reenactment

We propose an image-based, facial reenactment system that replaces the face of an actor in an existing target video with the face of a user from a source video, while preserving the original target performance. Our system is fully automatic and does not require a database of source expressions. Instead, it is able to produce convincing reenactment results from a short source video captured with an off-the-shelf camera, such as a webcam, where the user performs arbitrary facial gestures. Our reenactment pipeline is conceived as part image retrieval and part face transfer: The image retrieval is based on temporal clustering of target frames and a novel image matching metric that combines appearance and motion to select candidate frames from the source video, while the face transfer uses a 2D warping strategy that preserves the user's identity. Our system excels in simplicity as it does not rely on a 3D face model, it is robust under head motion and does not require the source and target performance to be similar. We show convincing reenactment results for videos that we recorded ourselves and for low-quality footage taken from the Internet.

* Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (8 pages) 
Viaarxiv icon