Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Stefanos Zafeiriou

Face Video Generation from a Single Image and Landmarks

Apr 25, 2019

Kritaphat Songsri-in, Stefanos Zafeiriou

Figure 1 for Face Video Generation from a Single Image and Landmarks

Figure 2 for Face Video Generation from a Single Image and Landmarks

Figure 3 for Face Video Generation from a Single Image and Landmarks

Figure 4 for Face Video Generation from a Single Image and Landmarks

Abstract:In this paper we are concerned with the challenging problem of producing a full image sequence of a deformable face given only an image and generic facial motions encoded by a set of sparse landmarks. To this end we build upon recent breakthroughs in image-to-image translation such as pix2pix, CycleGAN and StarGAN which learn Deep Convolutional Neural Networks (DCNNs) that learn to map aligned pairs or images between different domains (i.e., having different labels) and propose a new architecture which is not driven any more by labels but by spatial maps, facial landmarks. In particular, we propose the MotionGAN which transforms an input face image into a new one according to a heatmap of target landmarks. We show that it is possible to create very realistic face videos using a single image and a set of target landmarks. Furthermore, our method can be used to edit a facial image with arbitrary motions according to landmarks (e.g., expression, speech, etc.). This provides much more flexibility to face editing, expression transfer, facial video creation, etc. than models based on discrete expressions, audios or action units.

Via

Access Paper or Ask Questions

Synthesising 3D Facial Motion from "In-the-Wild" Speech

Apr 15, 2019

Panagiotis Tzirakis, Athanasios Papaioannou, Alexander Lattas, Michail Tarasiou, Björn Schuller, Stefanos Zafeiriou

Figure 1 for Synthesising 3D Facial Motion from "In-the-Wild" Speech

Figure 2 for Synthesising 3D Facial Motion from "In-the-Wild" Speech

Figure 3 for Synthesising 3D Facial Motion from "In-the-Wild" Speech

Figure 4 for Synthesising 3D Facial Motion from "In-the-Wild" Speech

Abstract:Synthesising 3D facial motion from speech is a crucial problem manifesting in a multitude of applications such as computer games and movies. Recently proposed methods tackle this problem in controlled conditions of speech. In this paper, we introduce the first methodology for 3D facial motion synthesis from speech captured in arbitrary recording conditions ("in-the-wild") and independent of the speaker. For our purposes, we captured 4D sequences of people uttering 500 words, contained in the Lip Reading Words (LRW) a publicly available large-scale in-the-wild dataset, and built a set of 3D blendshapes appropriate for speech. We correlate the 3D shape parameters of the speech blendshapes to the LRW audio samples by means of a novel time-warping technique, named Deep Canonical Attentional Warping (DCAW), that can simultaneously learn hierarchical non-linear representations and a warping path in an end-to-end manner. We thoroughly evaluate our proposed methods, and show the ability of a deep learning model to synthesise 3D facial motion in handling different speakers and continuous speech signals in uncontrolled conditions.

Via

Access Paper or Ask Questions

Dense 3D Face Decoding over 2500FPS: Joint Texture & Shape Convolutional Mesh Decoders

Apr 06, 2019

Yuxiang Zhou, Jiankang Deng, Irene Kotsia, Stefanos Zafeiriou

Figure 1 for Dense 3D Face Decoding over 2500FPS: Joint Texture & Shape Convolutional Mesh Decoders

Figure 2 for Dense 3D Face Decoding over 2500FPS: Joint Texture & Shape Convolutional Mesh Decoders

Figure 3 for Dense 3D Face Decoding over 2500FPS: Joint Texture & Shape Convolutional Mesh Decoders

Figure 4 for Dense 3D Face Decoding over 2500FPS: Joint Texture & Shape Convolutional Mesh Decoders

Abstract:3D Morphable Models (3DMMs) are statistical models that represent facial texture and shape variations using a set of linear bases and more particular Principal Component Analysis (PCA). 3DMMs were used as statistical priors for reconstructing 3D faces from images by solving non-linear least square optimization problems. Recently, 3DMMs were used as generative models for training non-linear mappings (\ie, regressors) from image to the parameters of the models via Deep Convolutional Neural Networks (DCNNs). Nevertheless, all of the above methods use either fully connected layers or 2D convolutions on parametric unwrapped UV spaces leading to large networks with many parameters. In this paper, we present the first, to the best of our knowledge, non-linear 3DMMs by learning joint texture and shape auto-encoders using direct mesh convolutions. We demonstrate how these auto-encoders can be used to train very light-weight models that perform Coloured Mesh Decoding (CMD) in-the-wild at a speed of over 2500 FPS.

Via

Access Paper or Ask Questions

GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction

Apr 06, 2019

Baris Gecer, Stylianos Ploumpis, Irene Kotsia, Stefanos Zafeiriou

Figure 1 for GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction

Figure 2 for GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction

Figure 3 for GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction

Figure 4 for GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction

Abstract:In the past few years, a lot of work has been done towards reconstructing the 3D facial structure from single images by capitalizing on the power of Deep Convolutional Neural Networks (DCNNs). In the most recent works, differentiable renderers were employed in order to learn the relationship between the facial identity features and the parameters of a 3D morphable model for shape and texture. The texture features either correspond to components of a linear texture space or are learned by auto-encoders directly from in-the-wild images. In all cases, the quality of the facial texture reconstruction of the state-of-the-art methods is still not capable of modeling textures in high fidelity. In this paper, we take a radically different approach and harness the power of Generative Adversarial Networks (GANs) and DCNNs in order to reconstruct the facial texture and shape from single images. That is, we utilize GANs to train a very powerful generator of facial texture in UV space. Then, we revisit the original 3D Morphable Models (3DMMs) fitting approaches making use of non-linear optimization to find the optimal latent parameters that best reconstruct the test image but under a new perspective. We optimize the parameters with the supervision of pretrained deep identity features through our end-to-end differentiable framework. We demonstrate excellent results in photorealistic and identity preserving 3D face reconstructions and achieve for the first time, to the best of our knowledge, facial texture reconstruction with high-frequency details.

* CVPR 2019 camera ready; Check project page: https://github.com/barisgecer/ganfit for full resolution results and more

Via

Access Paper or Ask Questions

MeshGAN: Non-linear 3D Morphable Models of Faces

Mar 25, 2019

Shiyang Cheng, Michael Bronstein, Yuxiang Zhou, Irene Kotsia, Maja Pantic, Stefanos Zafeiriou

Figure 1 for MeshGAN: Non-linear 3D Morphable Models of Faces

Figure 2 for MeshGAN: Non-linear 3D Morphable Models of Faces

Figure 3 for MeshGAN: Non-linear 3D Morphable Models of Faces

Figure 4 for MeshGAN: Non-linear 3D Morphable Models of Faces

Abstract:Generative Adversarial Networks (GANs) are currently the method of choice for generating visual data. Certain GAN architectures and training methods have demonstrated exceptional performance in generating realistic synthetic images (in particular, of human faces). However, for 3D object, GANs still fall short of the success they have had with images. One of the reasons is due to the fact that so far GANs have been applied as 3D convolutional architectures to discrete volumetric representations of 3D objects. In this paper, we propose the first intrinsic GANs architecture operating directly on 3D meshes (named as MeshGAN). Both quantitative and qualitative results are provided to show that MeshGAN can be used to generate high-fidelity 3D face with rich identities and expressions.

Via

Access Paper or Ask Questions

Combining 3D Morphable Models: A Large scale Face-and-Head Model

Mar 09, 2019

Stylianos Ploumpis, Haoyang Wang, Nick Pears, William A. P. Smith, Stefanos Zafeiriou

Figure 1 for Combining 3D Morphable Models: A Large scale Face-and-Head Model

Figure 2 for Combining 3D Morphable Models: A Large scale Face-and-Head Model

Figure 3 for Combining 3D Morphable Models: A Large scale Face-and-Head Model

Figure 4 for Combining 3D Morphable Models: A Large scale Face-and-Head Model

Abstract:Three-dimensional Morphable Models (3DMMs) are powerful statistical tools for representing the 3D surfaces of an object class. In this context, we identify an interesting question that has previously not received research attention: is it possible to combine two or more 3DMMs that (a) are built using different templates that perhaps only partly overlap, (b) have different representation capabilities and (c) are built from different datasets that may not be publicly-available? In answering this question, we make two contributions. First, we propose two methods for solving this problem: i. use a regressor to complete missing parts of one model using the other, ii. use the Gaussian Process framework to blend covariance matrices from multiple models. Second, as an example application of our approach, we build a new face-and-head shape model that combines the variability and facial detail of the LSFM with the full head modelling of the LYHM. The resulting combined shape model achieves state-of-the-art performance and outperforms existing head models by a large margin. Finally, as an application experiment, we reconstruct full head representations from single, unconstrained images by utilizing our proposed large-scale model in conjunction with the FaceWarehouse blendshapes for handling expressions.

* 9 pages, 8 figures. To appear in the Proceedings of Computer Vision and Pattern Recognition (CVPR), June 2019, Los Angeles, USA

Via

Access Paper or Ask Questions

Stacked Dense U-Nets with Dual Transformers for Robust Face Alignment

Dec 05, 2018

Jia Guo, Jiankang Deng, Niannan Xue, Stefanos Zafeiriou

Figure 1 for Stacked Dense U-Nets with Dual Transformers for Robust Face Alignment

Figure 2 for Stacked Dense U-Nets with Dual Transformers for Robust Face Alignment

Figure 3 for Stacked Dense U-Nets with Dual Transformers for Robust Face Alignment

Figure 4 for Stacked Dense U-Nets with Dual Transformers for Robust Face Alignment

Abstract:Facial landmark localisation in images captured in-the-wild is an important and challenging problem. The current state-of-the-art revolves around certain kinds of Deep Convolutional Neural Networks (DCNNs) such as stacked U-Nets and Hourglass networks. In this work, we innovatively propose stacked dense U-Nets for this task. We design a novel scale aggregation network topology structure and a channel aggregation building block to improve the model's capacity without sacrificing the computational complexity and model size. With the assistance of deformable convolutions inside the stacked dense U-Nets and coherent loss for outside data transformation, our model obtains the ability to be spatially invariant to arbitrary input face images. Extensive experiments on many in-the-wild datasets, validate the robustness of the proposed method under extreme poses, exaggerated expressions and heavy occlusions. Finally, we show that accurate 3D face alignment can assist pose-invariant face recognition where we achieve a new state-of-the-art accuracy on CFP-FP.

Via

Access Paper or Ask Questions

Generating faces for affect analysis

Nov 12, 2018

Dimitrios Kollias, Shiyang Cheng, Evangelos Ververas, Irene Kotsia, Stefanos Zafeiriou

Figure 1 for Generating faces for affect analysis

Figure 2 for Generating faces for affect analysis

Figure 3 for Generating faces for affect analysis

Figure 4 for Generating faces for affect analysis

Abstract:This paper presents a novel approach for synthesizing facial affect; either categorical, in terms of the six basic expressions (i.e., anger, disgust, fear, happiness, sadness and surprise), or dimensional, in terms of valence (i.e., how positive or negative is an emotion) and arousal (i.e., power of the emotion activation). In the Valence-Arousal case, a system is created, based on VA annotation of 600,000 frames from the 4DFAB database; in the categorical case, the system is based on the selection of apex frames of posed expression sequences from the 4DFAB. The proposed system accepts at its input: i) either the basic facial expression, or the pair of valence-arousal emotional state descriptors, which need to be synthesized and ii) a neutral 2D image of a person on which the corresponding affect will be synthesized. The proposed approach consists of the following steps: First, based on the provided desired emotional state, a set of 3D facial meshes is produced from the 4DFAB database and is used to build a blendshape model that generates the new facial affect. To synthesize this affect on the 2D neutral image, 3D Morphable Models fitting is performed and the reconstructed face is then deformed to generate the target facial expressions. Finally, the new face is rendered into the original image. Qualitative experimental studies illustrate the generation of realistic images, when the neutral image is sampled from a variety of well known lab-controlled or in-the-wild databases, including Aff-Wild, RECOLA, AffectNet, AFEW, Multi-PIE, AFEW-VA, BU-3DFE, Bosphorus, RAF-DB. Also, quantitative experiments are conducted, in which deep neural networks, trained using the generated images from each of the above databases in a data-augmentation framework, provide affect recognition; better performances are achieved through the presented approach when compared with the current state-of-the-art.

Via

Access Paper or Ask Questions

A Multi-Task Learning & Generation Framework: Valence-Arousal, Action Units & Primary Expressions

Nov 11, 2018

Dimitrios Kollias, Stefanos Zafeiriou

Figure 1 for A Multi-Task Learning & Generation Framework: Valence-Arousal, Action Units & Primary Expressions

Figure 2 for A Multi-Task Learning & Generation Framework: Valence-Arousal, Action Units & Primary Expressions

Figure 3 for A Multi-Task Learning & Generation Framework: Valence-Arousal, Action Units & Primary Expressions

Figure 4 for A Multi-Task Learning & Generation Framework: Valence-Arousal, Action Units & Primary Expressions

Abstract:Over the past few years many research efforts have been devoted to the field of affect analysis. Various approaches have been proposed for: i) discrete emotion recognition in terms of the primary facial expressions; ii) emotion analysis in terms of facial Action Units (AUs), assuming a fixed expression intensity; iii) dimensional emotion analysis, in terms of valence and arousal (VA). These approaches can only be effective, if they are developed using large, appropriately annotated databases, showing behaviors of people in-the-wild, i.e., in uncontrolled environments. Aff-Wild has been the first, large-scale, in-the-wild database (including around 1,200,000 frames of 300 videos), annotated in terms of VA. In the vast majority of existing emotion databases, their annotation is limited to either primary expressions, or valence-arousal, or action units. In this paper, we first annotate a part (around $234,000$ frames) of the Aff-Wild database in terms of $8$ AUs and another part (around $288,000$ frames) in terms of the $7$ basic emotion categories, so that parts of this database are annotated in terms of VA, as well as AUs, or primary expressions. Then, we set up and tackle multi-task learning for emotion recognition, as well as for facial image generation. Multi-task learning is performed using: i) a deep neural network with shared hidden layers, which learns emotional attributes by exploiting their inter-dependencies; ii) a discriminator of a generative adversarial network (GAN). On the other hand, image generation is implemented through the generator of the GAN. For these two tasks, we carefully design loss functions that fit the examined set-up. Experiments are presented which illustrate the good performance of the proposed approach when applied to the new annotated parts of the Aff-Wild database.

Via

Access Paper or Ask Questions

Aff-Wild2: Extending the Aff-Wild Database for Affect Recognition

Nov 11, 2018

Dimitrios Kollias, Stefanos Zafeiriou

Figure 1 for Aff-Wild2: Extending the Aff-Wild Database for Affect Recognition

Figure 2 for Aff-Wild2: Extending the Aff-Wild Database for Affect Recognition

Figure 3 for Aff-Wild2: Extending the Aff-Wild Database for Affect Recognition

Figure 4 for Aff-Wild2: Extending the Aff-Wild Database for Affect Recognition

Abstract:Automatic understanding of human affect using visual signals is a problem that has attracted significant interest over the past 20 years. However, human emotional states are quite complex. To appraise such states displayed in real-world settings, we need expressive emotional descriptors that are capable of capturing and describing this complexity. The circumplex model of affect, which is described in terms of valence (i.e., how positive or negative is an emotion) and arousal (i.e., power of the activation of the emotion), can be used for this purpose. Recent progress in the emotion recognition domain has been achieved through the development of deep neural architectures and the availability of very large training databases. To this end, Aff-Wild has been the first large-scale "in-the-wild" database, containing around 1,200,000 frames. In this paper, we build upon this database, extending it with 260 more subjects and 1,413,000 new video frames. We call the union of Aff-Wild with the additional data, Aff-Wild2. The videos are downloaded from Youtube and have large variations in pose, age, illumination conditions, ethnicity and profession. Both database-specific as well as cross-database experiments are performed in this paper, by utilizing the Aff-Wild2, along with the RECOLA database. The developed deep neural architectures are based on the joint training of state-of-the-art convolutional and recurrent neural networks with attention mechanism; thus exploiting both the invariant properties of convolutional features, while modeling temporal dynamics that arise in human behaviour via the recurrent layers. The obtained results show premise for utilization of the extended Aff-Wild, as well as of the developed deep neural architectures for visual analysis of human behaviour in terms of continuous emotion dimensions.

Via

Access Paper or Ask Questions