Reliable facial expression recognition plays a critical role in human-machine interactions. However, most of the facial expression analysis methodologies proposed to date pay little or no attention to the protection of a user's privacy. In this paper, we propose a Privacy-Preserving Representation-Learning Variational Generative Adversarial Network (PPRL-VGAN) to learn an image representation that is explicitly disentangled from the identity information. At the same time, this representation is discriminative from the standpoint of facial expression recognition and generative as it allows expression-equivalent face image synthesis. We evaluate the proposed model on two public datasets under various threat scenarios. Quantitative and qualitative results demonstrate that our approach strikes a balance between the preservation of privacy and data utility. We further demonstrate that our model can be effectively applied to other tasks such as expression morphing and image completion.
Image recognition is an important topic in computer vision and image processing, and has been mainly addressed by supervised deep learning methods, which need a large set of labeled images to achieve promising performance. However, in most cases, labeled data are expensive or even impossible to obtain, while unlabeled data are readily available from numerous free on-line resources and have been exploited to improve the performance of deep neural networks. To better exploit the power of unlabeled data for image recognition, in this paper, we propose a semi-supervised and generative approach, namely the semi-supervised self-growing generative adversarial network (SGGAN). Label inference is a key step for the success of semi-supervised learning approaches. There are two main problems in label inference: how to measure the confidence of the unlabeled data and how to generalize the classifier. We address these two problems via the generative framework and a novel convolution-block-transformation technique, respectively. To stabilize and speed up the training process of SGGAN, we employ the metric Maximum Mean Discrepancy as the feature matching objective function and achieve larger gain than the standard semi-supervised GANs (SSGANs), narrowing the gap to the supervised methods. Experiments on several benchmark datasets show the effectiveness of the proposed SGGAN on image recognition and facial attribute recognition tasks. By using the training data with only 4% labeled facial attributes, the SGGAN approach can achieve comparable accuracy with leading supervised deep learning methods with all labeled facial attributes.
Automatic affect recognition using visual cues is an important task towards a complete interaction between humans and machines. Applications can be found in tutoring systems and human computer interaction. A critical step towards that direction is facial feature extraction. In this paper, we propose a facial feature extractor model trained on an in-the-wild and massively collected video dataset provided by the RealEyes company. The dataset consists of a million labelled frames and 2,616 thousand subjects. As temporal information is important to the emotion recognition domain, we utilise LSTM cells to capture the temporal dynamics in the data. To show the favourable properties of our pre-trained model on modelling facial affect, we use the RECOLA database, and compare with the current state-of-the-art approach. Our model provides the best results in terms of concordance correlation coefficient.
Light field (LF) cameras provide rich spatio-angular visual representations by sensing the visual scene from multiple perspectives and have recently emerged as a promising technology to boost the performance of human-machine systems such as biometrics and affective computing. Despite the significant success of LF representation for constrained facial image analysis, this technology has never been used for face and expression recognition in the wild. In this context, this paper proposes a new deep face and expression recognition solution, called CapsField, based on a convolutional neural network and an additional capsule network that utilizes dynamic routing to learn hierarchical relations between capsules. CapsField extracts the spatial features from facial images and learns the angular part-whole relations for a selected set of 2D sub-aperture images rendered from each LF image. To analyze the performance of the proposed solution in the wild, the first in the wild LF face dataset, along with a new complementary constrained face dataset captured from the same subjects recorded earlier have been captured and are made available. A subset of the in the wild dataset contains facial images with different expressions, annotated for usage in the context of face expression recognition tests. An extensive performance assessment study using the new datasets has been conducted for the proposed and relevant prior solutions, showing that the CapsField proposed solution achieves superior performance for both face and expression recognition tasks when compared to the state-of-the-art.
Despite being the appearance-based classifier of choice in recent years, relatively few works have examined how much convolutional neural networks (CNNs) can improve performance on accepted expression recognition benchmarks and, more importantly, examine what it is they actually learn. In this work, not only do we show that CNNs can achieve strong performance, but we also introduce an approach to decipher which portions of the face influence the CNN's predictions. First, we train a zero-bias CNN on facial expression data and achieve, to our knowledge, state-of-the-art performance on two expression recognition benchmarks: the extended Cohn-Kanade (CK+) dataset and the Toronto Face Dataset (TFD). We then qualitatively analyze the network by visualizing the spatial patterns that maximally excite different neurons in the convolutional layers and show how they resemble Facial Action Units (FAUs). Finally, we use the FAU labels provided in the CK+ dataset to verify that the FAUs observed in our filter visualizations indeed align with the subject's facial movements.
In the recent past, different researchers have proposed novel privacy-enhancing face recognition systems designed to conceal soft-biometric information at feature level. These works have reported impressive results, but usually do not consider specific attacks in their analysis of privacy protection. In most cases, the privacy protection capabilities of these schemes are tested through simple machine learning-based classifiers and visualisations of dimensionality reduction tools. In this work, we introduce an attack on feature level-based facial soft-biometric privacy-enhancement techniques. The attack is based on two observations: (1) to achieve high recognition accuracy, certain similarities between facial representations have to be retained in their privacy-enhanced versions; (2) highly similar facial representations usually originate from face images with similar soft-biometric attributes. Based on these observations, the proposed attack compares a privacy-enhanced face representation against a set of privacy-enhanced face representations with known soft-biometric attributes. Subsequently, the best obtained similarity scores are analysed to infer the unknown soft-biometric attributes of the attacked privacy-enhanced face representation. That is, the attack only requires a relatively small database of arbitrary face images and the privacy-enhancing face recognition algorithm as a black-box. In the experiments, the attack is applied to two representative approaches which have previously been reported to reliably conceal the gender in privacy-enhanced face representations. It is shown that the presented attack is able to circumvent the privacy enhancement to a considerable degree and is able to correctly classify gender with an accuracy of up to approximately 90% for both of the analysed privacy-enhancing face recognition systems.
Recognizing facial action units (AUs) from spontaneous facial expressions is still a challenging problem. Most recently, CNNs have shown promise on facial AU recognition. However, the learned CNNs are often overfitted and do not generalize well to unseen subjects due to limited AU-coded training images. We proposed a novel Incremental Boosting CNN (IB-CNN) to integrate boosting into the CNN via an incremental boosting layer that selects discriminative neurons from the lower layer and is incrementally updated on successive mini-batches. In addition, a novel loss function that accounts for errors from both the incremental boosted classifier and individual weak classifiers was proposed to fine-tune the IB-CNN. Experimental results on four benchmark AU databases have demonstrated that the IB-CNN yields significant improvement over the traditional CNN and the boosting CNN without incremental learning, as well as outperforming the state-of-the-art CNN-based methods in AU recognition. The improvement is more impressive for the AUs that have the lowest frequencies in the databases.
Facial Expression Recognition is a commercially important application, but one common limitation is that applications often require making predictions on out-of-sample distributions, where target images may have very different properties from the images that the model was trained on. How well, or badly, do these models do on unseen target domains? In this paper, we provide a systematic evaluation of domain adaptation in facial expression recognition. Using state-of-the-art transfer learning techniques and six commonly-used facial expression datasets (three collected in the lab and three "in-the-wild"), we conduct extensive round-robin experiments to examine the classification accuracies for a state-of-the-art CNN model. We also perform multi-source experiments where we examine a model's ability to transfer from multiple source datasets, including (i) within-setting (e.g., lab to lab), (ii) cross-setting (e.g., in-the-wild to lab), (iii) mixed-setting (e.g., lab and wild to lab) transfer learning experiments. We find sobering results that the accuracy of transfer learning is not high, and varies idiosyncratically with the target dataset, and to a lesser extent the source dataset. Generally, the best settings for transfer include fine-tuning the weights of a pre-trained model, and we find that training with more datasets, regardless of setting, improves transfer performance. We end with a discussion of the need for more -- and regular -- systematic investigations into the generalizability of FER models, especially for deployed applications.
We investigate the problem of facial expression recognition using 3D data. Building from one of the most successful frameworks for facial analysis using exclusively 3D geometry, we extend the analysis from a curve-based representation into a spectral representation, which allows a complete description of the underlying surface that can be further tuned to the desired level of detail. Spectral representations are based on the decomposition of the geometry in its spatial frequency components, much like a Fourier transform, which are related to intrinsic characteristics of the surface. In this work, we propose the use of Graph Laplacian Features (GLF), which results from the projection of local surface patches into a common basis obtained from the Graph Laplacian eigenspace. We test the proposed approach in the BU-3DFE database in terms of expressions and Action Units recognition. Our results confirm that the proposed GLF produces consistently higher recognition rates than the curves-based approach, thanks to a more complete description of the surface, while requiring a lower computational complexity. We also show that the GLF outperform the most popular alternative approach for spectral representation, Shape- DNA, which is based on the Laplace Beltrami Operator and cannot provide a stable basis that guarantee that the extracted signatures for the different patches are directly comparable.
Over the past few years, deep learning methods have shown remarkable results in many face-related tasks including automatic facial expression recognition (FER) in-the-wild. Meanwhile, numerous models describing the human emotional states have been proposed by the psychology community. However, we have no clear evidence as to which representation is more appropriate and the majority of FER systems use either the categorical or the dimensional model of affect. Inspired by recent work in multi-label classification, this paper proposes a novel multi-task learning (MTL) framework that exploits the dependencies between these two models using a Graph Convolutional Network (GCN) to recognize facial expressions in-the-wild. Specifically, a shared feature representation is learned for both discrete and continuous recognition in a MTL setting. Moreover, the facial expression classifiers and the valence-arousal regressors are learned through a GCN that explicitly captures the dependencies between them. To evaluate the performance of our method under real-world conditions we train our models on AffectNet dataset. The results of our experiments show that our method outperforms the current state-of-the-art methods on discrete FER.