Alert button
Picture for Gaurav Bharaj

Gaurav Bharaj

Alert button

Implicit Neural Head Synthesis via Controllable Local Deformation Fields

Apr 21, 2023
Chuhan Chen, Matthew O'Toole, Gaurav Bharaj, Pablo Garrido

Figure 1 for Implicit Neural Head Synthesis via Controllable Local Deformation Fields
Figure 2 for Implicit Neural Head Synthesis via Controllable Local Deformation Fields
Figure 3 for Implicit Neural Head Synthesis via Controllable Local Deformation Fields
Figure 4 for Implicit Neural Head Synthesis via Controllable Local Deformation Fields

High-quality reconstruction of controllable 3D head avatars from 2D videos is highly desirable for virtual human applications in movies, games, and telepresence. Neural implicit fields provide a powerful representation to model 3D head avatars with personalized shape, expressions, and facial parts, e.g., hair and mouth interior, that go beyond the linear 3D morphable model (3DMM). However, existing methods do not model faces with fine-scale facial features, or local control of facial parts that extrapolate asymmetric expressions from monocular videos. Further, most condition only on 3DMM parameters with poor(er) locality, and resolve local features with a global neural field. We build on part-based implicit shape models that decompose a global deformation field into local ones. Our novel formulation models multiple implicit deformation fields with local semantic rig-like control via 3DMM-based parameters, and representative facial landmarks. Further, we propose a local control loss and attention mask mechanism that promote sparsity of each learned deformation field. Our formulation renders sharper locally controllable nonlinear deformations than previous implicit monocular approaches, especially mouth interior, asymmetric expressions, and facial details.

* Accepted at CVPR 2023 
Viaarxiv icon

Few-shot Geometry-Aware Keypoint Localization

Mar 30, 2023
Xingzhe He, Gaurav Bharaj, David Ferman, Helge Rhodin, Pablo Garrido

Figure 1 for Few-shot Geometry-Aware Keypoint Localization
Figure 2 for Few-shot Geometry-Aware Keypoint Localization
Figure 3 for Few-shot Geometry-Aware Keypoint Localization
Figure 4 for Few-shot Geometry-Aware Keypoint Localization

Supervised keypoint localization methods rely on large manually labeled image datasets, where objects can deform, articulate, or occlude. However, creating such large keypoint labels is time-consuming and costly, and is often error-prone due to inconsistent labeling. Thus, we desire an approach that can learn keypoint localization with fewer yet consistently annotated images. To this end, we present a novel formulation that learns to localize semantically consistent keypoint definitions, even for occluded regions, for varying object categories. We use a few user-labeled 2D images as input examples, which are extended via self-supervision using a larger unlabeled dataset. Unlike unsupervised methods, the few-shot images act as semantic shape constraints for object localization. Furthermore, we introduce 3D geometry-aware constraints to uplift keypoints, achieving more accurate 2D localization. Our general-purpose formulation paves the way for semantically conditioned generative modeling and attains competitive or state-of-the-art accuracy on several datasets, including human faces, eyes, animals, cars, and never-before-seen mouth interior (teeth) localization tasks, not attempted by the previous few-shot methods. Project page: https://xingzhehe.github.io/FewShot3DKP/}{https://xingzhehe.github.io/FewShot3DKP/

* Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023  
* CVPR 2023 
Viaarxiv icon

ALAP-AE: As-Lite-as-Possible Auto-Encoder

Mar 19, 2022
Nisarg A. Shah, Gaurav Bharaj

Figure 1 for ALAP-AE: As-Lite-as-Possible Auto-Encoder
Figure 2 for ALAP-AE: As-Lite-as-Possible Auto-Encoder
Figure 3 for ALAP-AE: As-Lite-as-Possible Auto-Encoder
Figure 4 for ALAP-AE: As-Lite-as-Possible Auto-Encoder

We present a novel algorithm to reduce tensor compute required by a conditional image generation autoencoder and make it as-lite-as-possible, without sacrificing quality of photo-realistic image generation. Our method is device agnostic, and can optimize an autoencoder for a given CPU-only, GPU compute device(s) in about normal time it takes to train an autoencoder on a generic workstation. We achieve this via a two-stage novel strategy where, first, we condense the channel weights, such that, as few as possible channels are used. Then, we prune the nearly zeroed out weight activations, and fine-tune this lite autoencoder. To maintain image quality, fine-tuning is done via student-teacher training, where we reuse the condensed autoencoder as the teacher. We show performance gains for various conditional image generation tasks: segmentation mask to face images, face images to cartoonization, and finally CycleGAN-based model on horse to zebra dataset over multiple compute devices. We perform various ablation studies to justify the claims and design choices, and achieve real-time versions of various autoencoders on CPU-only devices while maintaining image quality, thus enabling at-scale deployment of such autoencoders.

Viaarxiv icon

Multi-Domain Multi-Definition Landmark Localization for Small Datasets

Mar 19, 2022
David Ferman, Gaurav Bharaj

Figure 1 for Multi-Domain Multi-Definition Landmark Localization for Small Datasets
Figure 2 for Multi-Domain Multi-Definition Landmark Localization for Small Datasets
Figure 3 for Multi-Domain Multi-Definition Landmark Localization for Small Datasets
Figure 4 for Multi-Domain Multi-Definition Landmark Localization for Small Datasets

We present a novel method for multi image domain and multi-landmark definition learning for small dataset facial localization. Training a small dataset alongside a large(r) dataset helps with robust learning for the former, and provides a universal mechanism for facial landmark localization for new and/or smaller standard datasets. To this end, we propose a Vision Transformer encoder with a novel decoder with a definition agnostic shared landmark semantic group structured prior, that is learnt, as we train on more than one dataset concurrently. Due to our novel definition agnostic group prior the datasets may vary in landmark definitions and domains. During the decoder stage we use cross- and self-attention, whose output is later fed into domain/definition specific heads that minimize a Laplacian-log-likelihood loss. We achieve state-of-the-art performance on standard landmark localization datasets such as COFW and WFLW, when trained with a bigger dataset. We also show state-of-the-art performance on several varied image domain small datasets for animals, caricatures, and facial portrait paintings. Further, we contribute a small dataset (150 images) of pareidolias to show efficacy of our method. Finally, we provide several analysis and ablation studies to justify our claims.

* 16 
Viaarxiv icon

Generalized Spoofing Detection Inspired from Audio Generation Artifacts

Apr 08, 2021
Yang Gao, Tyler Vuong, Mahsa Elyasi, Gaurav Bharaj, Rita Singh

Figure 1 for Generalized Spoofing Detection Inspired from Audio Generation Artifacts
Figure 2 for Generalized Spoofing Detection Inspired from Audio Generation Artifacts
Figure 3 for Generalized Spoofing Detection Inspired from Audio Generation Artifacts
Figure 4 for Generalized Spoofing Detection Inspired from Audio Generation Artifacts

State-of-the-art methods for audio generation suffer from fingerprint artifacts and repeated inconsistencies across temporal and spectral domains. Such artifacts could be well captured by the frequency domain analysis over the spectrogram. Thus, we propose a novel use of long-range spectro-temporal modulation feature -- 2D DCT over log-Mel spectrogram for the audio deepfake detection. We show that this feature works better than log-Mel spectrogram, CQCC, MFCC, etc., as a suitable candidate to capture such artifacts. We employ spectrum augmentation and feature normalization to decrease overfitting and bridge the gap between training and test dataset along with this novel feature introduction. We developed a CNN-based baseline that achieved a 0.0849 t-DCF and outperformed the best single system reported in the ASVspoof 2019 challenge. Finally, by combining our baseline with our proposed 2D DCT spectro-temporal feature, we decrease the t-DCF score down by 14% to 0.0737, making it one of the best systems for spoofing detection. Furthermore, we evaluate our model using two external datasets, showing the proposed feature's generalization ability. We also provide analysis and ablation studies for our proposed feature and results.

* V0, Submitted to INTERSPEECH 2021 
Viaarxiv icon

Grapheme-to-Phoneme Transformer Model for Transfer Learning Dialects

Apr 08, 2021
Eric Engelhart, Mahsa Elyasi, Gaurav Bharaj

Figure 1 for Grapheme-to-Phoneme Transformer Model for Transfer Learning Dialects
Figure 2 for Grapheme-to-Phoneme Transformer Model for Transfer Learning Dialects
Figure 3 for Grapheme-to-Phoneme Transformer Model for Transfer Learning Dialects
Figure 4 for Grapheme-to-Phoneme Transformer Model for Transfer Learning Dialects

Grapheme-to-Phoneme (G2P) models convert words to their phonetic pronunciations. Classic G2P methods include rule-based systems and pronunciation dictionaries, while modern G2P systems incorporate learning, such as, LSTM and Transformer-based attention models. Usually, dictionary-based methods require significant manual effort to build, and have limited adaptivity on unseen words. And transformer-based models require significant training data, and do not generalize well, especially for dialects with limited data. We propose a novel use of transformer-based attention model that can adapt to unseen dialects of English language, while using a small dictionary. We show that our method has potential applications for accent transfer for text-to-speech, and for building robust G2P models for dialects with limited pronunciation dictionary size. We experiment with two English dialects: Indian and British. A model trained from scratch using 1000 words from British English dictionary, with 14211 words held out, leads to phoneme error rate (PER) of 26.877%, on a test set generated using the full dictionary. The same model pretrained on CMUDict American English dictionary, and fine-tuned on the same dataset leads to PER of 2.469% on the test set.

* 5 
Viaarxiv icon

Generative Landmarks

Apr 08, 2021
David Ferman, Gaurav Bharaj

Figure 1 for Generative Landmarks
Figure 2 for Generative Landmarks

We propose a general purpose approach to detect landmarks with improved temporal consistency, and personalization. Most sparse landmark detection methods rely on laborious, manually labelled landmarks, where inconsistency in annotations over a temporal volume leads to sub-optimal landmark learning. Further, high-quality landmarks with personalization is often hard to achieve. We pose landmark detection as an image translation problem. We capture two sets of unpaired marked (with paint) and unmarked videos. We then use a generative adversarial network and cyclic consistency to predict deformations of landmark templates that simulate markers on unmarked images until these images are indistinguishable from ground-truth marked images. Our novel method does not rely on manually labelled priors, is temporally consistent, and image class agnostic -- face, and hand landmarks detection examples are shown.

* 2 
Viaarxiv icon

Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features

Apr 08, 2021
Mahsa Elyasi, Gaurav Bharaj

Figure 1 for Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features
Figure 2 for Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features
Figure 3 for Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features
Figure 4 for Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features

Neural sequence-to-sequence text-to-speech synthesis (TTS), such as Tacotron-2, transforms text into high-quality speech. However, generating speech with natural prosody still remains a challenge. Yasuda et. al. show that unlike natural speech, Tacotron-2's encoder doesn't fully represent prosodic features (e.g. syllable stress in English) from characters, and result in flat fundamental frequency variations. In this work, we propose a novel carefully designed strategy for conditioning Tacotron-2 on two fundamental prosodic features in English -- stress syllable and pitch accent, that help achieve more natural prosody. To this end, we use of a classifier to learn these features in an end-to-end fashion, and apply feature conditioning at three parts of Tacotron-2's Text-To-Mel Spectrogram: pre-encoder, post-encoder, and intra-decoder. Further, we show that jointly conditioned features at pre-encoder and intra-decoder stages result in prosodically natural synthesized speech (vs. Tacotron-2), and allows the model to produce speech with more accurate pitch accent and stress patterns. Quantitative evaluations show that our formulation achieves higher fundamental frequency contour correlation, and lower Mel Cepstral Distortion measure between synthesized and natural speech. And subjective evaluation shows that the proposed method's Mean Opinion Score of 4.14 fairs higher than baseline Tacotron-2, 3.91, when compared against natural speech (LJSpeech corpus), 4.28.

* 5 
Viaarxiv icon

Practical Face Reconstruction via Differentiable Ray Tracing

Jan 13, 2021
Abdallah Dib, Gaurav Bharaj, Junghyun Ahn, Cédric Thébault, Philippe-Henri Gosselin, Marco Romeo, Louis Chevallier

Figure 1 for Practical Face Reconstruction via Differentiable Ray Tracing
Figure 2 for Practical Face Reconstruction via Differentiable Ray Tracing
Figure 3 for Practical Face Reconstruction via Differentiable Ray Tracing
Figure 4 for Practical Face Reconstruction via Differentiable Ray Tracing

We present a differentiable ray-tracing based novel face reconstruction approach where scene attributes - 3D geometry, reflectance (diffuse, specular and roughness), pose, camera parameters, and scene illumination - are estimated from unconstrained monocular images. The proposed method models scene illumination via a novel, parameterized virtual light stage, which in-conjunction with differentiable ray-tracing, introduces a coarse-to-fine optimization formulation for face reconstruction. Our method can not only handle unconstrained illumination and self-shadows conditions, but also estimates diffuse and specular albedos. To estimate the face attributes consistently and with practical semantics, a two-stage optimization strategy systematically uses a subset of parametric attributes, where subsequent attribute estimations factor those previously estimated. For example, self-shadows estimated during the first stage, later prevent its baking into the personalized diffuse and specular albedos in the second stage. We show the efficacy of our approach in several real-world scenarios, where face attributes can be estimated even under extreme illumination conditions. Ablation studies, analyses and comparisons against several recent state-of-the-art methods show improved accuracy and versatility of our approach. With consistent face attributes reconstruction, our method leads to several style -- illumination, albedo, self-shadow -- edit and transfer applications, as discussed in the paper.

* 16 pages, 14 figures 
Viaarxiv icon

StyleRig: Rigging StyleGAN for 3D Control over Portrait Images

Mar 31, 2020
Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Seidel, Patrick Pérez, Michael Zollhöfer, Christian Theobalt

Figure 1 for StyleRig: Rigging StyleGAN for 3D Control over Portrait Images
Figure 2 for StyleRig: Rigging StyleGAN for 3D Control over Portrait Images
Figure 3 for StyleRig: Rigging StyleGAN for 3D Control over Portrait Images
Figure 4 for StyleRig: Rigging StyleGAN for 3D Control over Portrait Images

StyleGAN generates photorealistic portrait images of faces with eyes, teeth, hair and context (neck, shoulders, background), but lacks a rig-like control over semantic face parameters that are interpretable in 3D, such as face pose, expressions, and scene illumination. Three-dimensional morphable face models (3DMMs) on the other hand offer control over the semantic parameters, but lack photorealism when rendered and only model the face interior, not other parts of a portrait image (hair, mouth interior, background). We present the first method to provide a face rig-like control over a pretrained and fixed StyleGAN via a 3DMM. A new rigging network, RigNet is trained between the 3DMM's semantic parameters and StyleGAN's input. The network is trained in a self-supervised manner, without the need for manual annotations. At test time, our method generates portrait images with the photorealism of StyleGAN and provides explicit control over the 3D semantic parameters of the face.

* CVPR 2020 (Oral). Project page: https://gvv.mpi-inf.mpg.de/projects/StyleRig/ 
Viaarxiv icon