Over the last several years, research on facial recognition based on Deep Neural Network has evolved with approaches like task-specific loss functions, image normalization and augmentation, network architectures, etc. However, there have been few approaches with attention to how human faces differ from person to person. Premising that inter-personal differences are found both generally and locally on the human face, I propose FusiformNet, a novel framework for feature extraction that leverages the nature of discriminative facial features. Tested on Image-Unrestricted setting of Labeled Face in the Wild benchmark, this method achieved a state-of-the-art accuracy of 96.67% without labeled outside data, image augmentation, normalization, or special loss functions. Likewise, the method also performed on par with previous state-of-the-arts when pre-trained on CASIA-WebFace dataset. Considering its ability to extract both general and local facial features, the utility of FusiformNet may not be limited to facial recognition but also extend to other DNN-based tasks.
Representations used for Facial Expression Recognition (FER) usually contain expression information along with identity features. In this paper, we propose a novel Disentangled Expression learning-Generative Adversarial Network (DE-GAN) which combines the concept of disentangled representation learning with residue learning to explicitly disentangle facial expression representation from identity information. In this method the facial expression representation is learned by reconstructing an expression image employing an encoder-decoder based generator. Unlike previous works using only expression residual learning for facial expression recognition, our method learns the disentangled expression representation along with the expressive component recorded by the encoder of DE-GAN. In order to improve the quality of synthesized expression images and the effectiveness of the learned disentangled expression representation, expression and identity classification is performed by the discriminator of DE-GAN. Experiments performed on widely used datasets (CK+, MMI, Oulu-CASIA) show that the proposed technique produces comparable or better results than state-of-the-art methods.
Each human face is unique. It has its own shape, topology, and distinguishing features. As such, developing and testing facial tracking systems are challenging tasks. The existing face recognition and tracking algorithms in Computer Vision mainly specify concrete situations according to particular goals and applications, requiring validation methodologies with data that fits their purposes. However, a database that covers all possible variations of external and factors does not exist, increasing researchers' work in acquiring their own data or compiling groups of databases. To address this shortcoming, we propose a methodology for facial data acquisition through definition of fundamental variables, such as subject characteristics, acquisition hardware, and performance parameters. Following this methodology, we also propose two protocols that allow the capturing of facial behaviors under uncontrolled and real-life situations. As validation, we executed both protocols which lead to creation of two sample databases: FdMiee (Facial database with Multi input, expressions, and environments) and FACIA (Facial Multimodal database driven by emotional induced acting). Using different types of hardware, FdMiee captures facial information under environmental and facial behaviors variations. FACIA is an extension of FdMiee introducing a pipeline to acquire additional facial behaviors and speech using an emotion-acting method. Therefore, this work eases the creation of adaptable database according to algorithm's requirements and applications, leading to simplified validation and testing processes.
Obtaining large, human labelled speech datasets to train models for emotion recognition is a notoriously challenging task, hindered by annotation cost and label ambiguity. In this work, we consider the task of learning embeddings for speech classification without access to any form of labelled audio. We base our approach on a simple hypothesis: that the emotional content of speech correlates with the facial expression of the speaker. By exploiting this relationship, we show that annotations of expression can be transferred from the visual domain (faces) to the speech domain (voices) through cross-modal distillation. We make the following contributions: (i) we develop a strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark; (ii) we use the teacher to train a student, tabula rasa, to learn representations (embeddings) for speech emotion recognition without access to labelled audio data; and (iii) we show that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets. Code, models and data are available.
Affect recognition based on subjects' facial expressions has been a topic of major research in the attempt to generate machines that can understand the way subjects feel, act and react. In the past, due to the unavailability of large amounts of data captured in real-life situations, research has mainly focused on controlled environments. However, recently, social media and platforms have been widely used. Moreover, deep learning has emerged as a means to solve visual analysis and recognition problems. This paper exploits these advances and presents significant contributions for affect analysis and recognition in-the-wild. Affect analysis and recognition can be seen as a dual knowledge generation problem, involving: i) creation of new, large and rich in-the-wild databases and ii) design and training of novel deep neural architectures that are able to analyse affect over these databases and to successfully generalise their performance on other datasets. The paper focuses on large in-the-wild databases, i.e., Aff-Wild and Aff-Wild2 and presents the design of two classes of deep neural networks trained with these databases. The first class refers to uni-task affect recognition, focusing on prediction of the valence and arousal dimensional variables. The second class refers to estimation of all main behavior tasks, i.e. valence-arousal prediction; categorical emotion classification in seven basic facial expressions; facial Action Unit detection. A novel multi-task and holistic framework is presented which is able to jointly learn and effectively generalize and perform affect recognition over all existing in-the-wild databases. Large experimental studies illustrate the achieved performance improvement over the existing state-of-the-art in affect recognition.
There are many facts affecting human face recognition, such as pose, occlusion, illumination, age, etc. First and foremost are large pose and occlusion problems, which can even result in more than 10% performance degradation. Pose-invariant feature representation and face frontalization with generative adversarial networks (GAN) have been widely used to solve the pose problem. However, the synthesis and recognition of occlusive but profile faces is still an uninvestigated problem. To address this issue, in this paper, we aim to contribute an effective solution on how to recognize occlusive but profile faces, even with facial keypoint region (e.g. eyes, nose, etc.) corrupted. Specifically, we propose a boosting Generative Adversarial Network (BoostGAN) for de-occlusion, frontalization, and recognition of faces. Upon the assumption that facial occlusion is partial and incomplete, multiple patch occluded images are fed as inputs for knowledge boosting, such as identity and texture information. A new aggregation structure composed of a deep GAN for coarse face synthesis and a shallow boosting net for fine face generation is further designed. Exhaustive experiments demonstrate that the proposed approach not only presents clear perceptual photo-realistic results but also shows state-of-the-art recognition performance for occlusive but profile faces.
Emotion recognition can provide crucial information about the user in many applications when building human-computer interaction (HCI) systems. Most of current researches on visual emotion recognition are focusing on exploring facial features. However, context information including surrounding environment and human body can also provide extra clues to recognize emotion more accurately. Inspired by "sequence to sequence model" for neural machine translation, which models input and output sequences by an encoder and a decoder in recurrent neural network (RNN) architecture respectively, a novel architecture, "CACA-RNN", is proposed in this work. The proposed network consists of two RNNs in a cascaded architecture to process both context and facial information to perform video emotion classification. Results of the model were submitted to video emotion recognition sub-challenge in Multimodal Emotion Recognition Challenge (MEC2017). CACA-RNN outperforms the MEC2017 baseline (mAP of 21.7%): it achieved mAP of 45.51% on the testing set in the video only challenge.
In this paper, we propose a new approach for facial expression recognition. The solution is based on the idea of encoding local and global Deep Convolutional Neural Network (DCNN) features extracted from still images, in compact local and global covariance descriptors. The space geometry of the covariance matrices is that of Symmetric Positive Definite (SPD) matrices. By performing the classification of static facial expressions using a valid Gaussian kernel on the SPD manifold and Support Vector Machine (SVM), we show that the covariance descriptors computed on DCNN features are more effective than the standard classification with fully connected layers and softmax. Besides, we propose a completely new and original solution to model the temporal dynamic of facial expressions as deep trajectories on the SPD manifold. As an extension of the classification pipeline of covariance descriptors, we apply SVM with valid positive definite kernels derived from global alignment for deep covariance trajectories classification. By conducting extensive experiments on the Oulu-CASIA, CK+, and SFEW datasets, we show that both the proposed static and dynamic approaches achieve state-of-the-art performance for facial expression recognition outperforming most recent approaches.
Facial expression recognition has many potential applications which has attracted the attention of researchers in the last decade. Feature extraction is one important step in expression analysis which contributes toward fast and accurate expression recognition. This paper represents an approach of combining the shape and appearance features to form a hybrid feature vector. We have extracted Pyramid of Histogram of Gradients (PHOG) as shape descriptors and Local Binary Patterns (LBP) as appearance features. The proposed framework involves a novel approach of extracting hybrid features from active facial patches. The active facial patches are located on the face regions which undergo a major change during different expressions. After detection of facial landmarks, the active patches are localized and hybrid features are calculated from these patches. The use of small parts of face instead of the whole face for extracting features reduces the computational cost and prevents the over-fitting of the features for classification. By using linear discriminant analysis, the dimensionality of the feature is reduced which is further classified by using the support vector machine (SVM). The experimental results on two publicly available databases show promising accuracy in recognizing all expression classes.
Most of the existing deep neural nets on automatic facial expression recognition focus on a set of predefined emotion classes, where the amount of training data has the biggest impact on performance. However, over-parameterised neural networks are not amenable for learning from few samples as they can quickly over-fit. In addition, these approaches do not have such a strong generalisation ability to identify a new category, where the data of each category is too limited and significant variations exist in the expression within the same semantic category. We embrace these challenges and formulate the problem as a low-shot learning, where once the base classifier is deployed, it must rapidly adapt to recognise novel classes using a few samples. In this paper, we revisit and compare existing few-shot learning methods for the low-shot facial expression recognition in terms of their generalisation ability via episode-training. In particular, we extend our analysis on the cross-domain generalisation, where training and test tasks are not drawn from the same distribution. We demonstrate the efficacy of low-shot learning methods through extensive experiments.