Recent advances in domain adaptation, especially those applied to heterogeneous facial recognition, typically rely upon restrictive Euclidean loss functions (e.g., $L_2$ norm) which perform best when images from two different domains (e.g., visible and thermal) are co-registered and temporally synchronized. This paper proposes a novel domain adaptation framework that combines a new feature mapping sub-network with existing deep feature models, which are based on modified network architectures (e.g., VGG16 or Resnet50). This framework is optimized by introducing new cross-domain identity and domain invariance loss functions for thermal-to-visible face recognition, which alleviates the requirement for precisely co-registered and synchronized imagery. We provide extensive analysis of both features and loss functions used, and compare the proposed domain adaptation framework with state-of-the-art feature based domain adaptation models on a difficult dataset containing facial imagery collected at varying ranges, poses, and expressions. Moreover, we analyze the viability of the proposed framework for more challenging tasks, such as non-frontal thermal-to-visible face recognition.
Facial expressions are an important way through which humans interact socially. Building a system capable of automatically recognizing facial expressions from images and video has been an intense field of study in recent years. Interpreting such expressions remains challenging and much research is needed about the way they relate to human affect. This paper presents a general overview of automatic RGB, 3D, thermal and multimodal facial expression analysis. We define a new taxonomy for the field, encompassing all steps from face detection to facial expression recognition, and describe and classify the state of the art methods accordingly. We also present the important datasets and the bench-marking of most influential methods. We conclude with a general discussion about trends, important questions and future lines of research.
Face morphing attack is proved to be a serious threat to the existing face recognition systems. Although a few face morphing detection methods have been put forward, the face morphing accomplice's facial restoration remains a challenging problem. In this paper, a face-demorphing generative adversarial network (FD-GAN) is proposed to restore the accomplice's facial image. It utilizes a symmetric dual network architecture and two levels of restoration losses to separate the identity feature of the morphing accomplice. By exploiting the captured face image (containing the criminal's identity) from the face recognition system and the morphed image stored in the e-passport system (containing both criminal and accomplice's identities), the FD-GAN can effectively restore the accomplice's facial image. Experimental results and analysis demonstrate the effectiveness of the proposed scheme. It has great potential to be implemented for detecting the face morphing accomplice in a real identity verification scenario.
In this paper, we consider the problem of real-time video-based facial emotion analytics, namely, facial expression recognition, prediction of valence and arousal and detection of action unit points. We propose the novel frame-level emotion recognition algorithm by extracting facial features with the single EfficientNet model pre-trained on AffectNet. As a result, our approach may be implemented even for video analytics on mobile devices. Experimental results for the large scale Aff-Wild2 database from the third Affective Behavior Analysis in-the-wild (ABAW) Competition demonstrate that our simple model is significantly better when compared to the VggFace baseline. In particular, our method is characterized by 0.15-0.2 higher performance measures for validation sets in uni-task Expression Classification, Valence-Arousal Estimation and Expression Classification. Due to simplicity, our approach may be considered as a new baseline for all four sub-challenges.
Emotional expressions are the behaviors that communicate our emotional state or attitude to others. They are expressed through verbal and non-verbal communication. Complex human behavior can be understood by studying physical features from multiple modalities; mainly facial, vocal and physical gestures. Recently, spontaneous multi-modal emotion recognition has been extensively studied for human behavior analysis. In this paper, we propose a new deep learning-based approach for audio-visual emotion recognition. Our approach leverages recent advances in deep learning like knowledge distillation and high-performing deep architectures. The deep feature representations of the audio and visual modalities are fused based on a model-level fusion strategy. A recurrent neural network is then used to capture the temporal dynamics. Our proposed approach substantially outperforms state-of-the-art approaches in predicting valence on the RECOLA dataset. Moreover, our proposed visual facial expression feature extraction network outperforms state-of-the-art results on the AffectNet and Google Facial Expression Comparison datasets.
This paper presents multi-appearance fusion of Principal Component Analysis (PCA) and generalization of Linear Discriminant Analysis (LDA) for multi-camera view offline face recognition (verification) system. The generalization of LDA has been extended to establish correlations between the face classes in the transformed representation and this is called canonical covariate. The proposed system uses Gabor filter banks for characterization of facial features by spatial frequency, spatial locality and orientation to make compensate to the variations of face instances occurred due to illumination, pose and facial expression changes. Convolution of Gabor filter bank to face images produces Gabor face representations with high dimensional feature vectors. PCA and canonical covariate are then applied on the Gabor face representations to reduce the high dimensional feature spaces into low dimensional Gabor eigenfaces and Gabor canonical faces. Reduced eigenface vector and canonical face vector are fused together using weighted mean fusion rule. Finally, support vector machines (SVM) have trained with augmented fused set of features and perform the recognition task. The system has been evaluated with UMIST face database consisting of multiview faces. The experimental results demonstrate the efficiency and robustness of the proposed system for multi-view face images with high recognition rates. Complexity analysis of the proposed system is also presented at the end of the experimental results.
Limited annotated data available for the recognition of facial expression and action units embarrasses the training of deep networks, which can learn disentangled invariant features. However, a linear model with just several parameters normally is not demanding in terms of training data. In this paper, we propose an elegant linear model to untangle confounding factors in challenging realistic multichannel signals such as 2D face videos. The simple yet powerful model does not rely on huge training data and is natural for recognizing facial actions without explicitly disentangling the identity. Base on well-understood intuitive linear models such as Sparse Representation based Classification (SRC), previous attempts require a prepossessing of explicit decoupling which is practically inexact. Instead, we exploit the low-rank property across frames to subtract the underlying neutral faces which are modeled jointly with sparse representation on the action components with group sparsity enforced. On the extended Cohn-Kanade dataset (CK+), our one-shot automatic method on raw face videos performs as competitive as SRC applied on manually prepared action components and performs even better than SRC in terms of true positive rate. We apply the model to the even more challenging task of facial action unit recognition, verified on the MPI Face Video Database (MPI-VDB) achieving a decent performance. All the programs and data have been made publicly available.
Objective functions for training of deep networks for face-related recognition tasks, such as facial expression recognition (FER), usually consider each sample independently. In this work, we present a novel peak-piloted deep network (PPDN) that uses a sample with peak expression (easy sample) to supervise the intermediate feature responses for a sample of non-peak expression (hard sample) of the same type and from the same subject. The expression evolving process from non-peak expression to peak expression can thus be implicitly embedded in the network to achieve the invariance to expression intensities. A special purpose back-propagation procedure, peak gradient suppression (PGS), is proposed for network training. It drives the intermediate-layer feature responses of non-peak expression samples towards those of the corresponding peak expression samples, while avoiding the inverse. This avoids degrading the recognition capability for samples of peak expression due to interference from their non-peak expression counterparts. Extensive comparisons on two popular FER datasets, Oulu-CASIA and CK+, demonstrate the superiority of the PPDN over state-ofthe-art FER methods, as well as the advantages of both the network structure and the optimization strategy. Moreover, it is shown that PPDN is a general architecture, extensible to other tasks by proper definition of peak and non-peak samples. This is validated by experiments that show state-of-the-art performance on pose-invariant face recognition, using the Multi-PIE dataset.
Currently, many critical care indices are repetitively assessed and recorded by overburdened nurses, e.g. physical function or facial pain expressions of nonverbal patients. In addition, many essential information on patients and their environment are not captured at all, or are captured in a non-granular manner, e.g. sleep disturbance factors such as bright light, loud background noise, or excessive visitations. In this pilot study, we examined the feasibility of using pervasive sensing technology and artificial intelligence for autonomous and granular monitoring of critically ill patients and their environment in the Intensive Care Unit (ICU). As an exemplar prevalent condition, we also characterized delirious and non-delirious patients and their environment. We used wearable sensors, light and sound sensors, and a high-resolution camera to collected data on patients and their environment. We analyzed collected data using deep learning and statistical analysis. Our system performed face detection, face recognition, facial action unit detection, head pose detection, facial expression recognition, posture recognition, actigraphy analysis, sound pressure and light level detection, and visitation frequency detection. We were able to detect patient's face (Mean average precision (mAP)=0.94), recognize patient's face (mAP=0.80), and their postures (F1=0.94). We also found that all facial expressions, 11 activity features, visitation frequency during the day, visitation frequency during the night, light levels, and sound pressure levels during the night were significantly different between delirious and non-delirious patients (p-value<0.05). In summary, we showed that granular and autonomous monitoring of critically ill patients and their environment is feasible and can be used for characterizing critical care conditions and related environment factors.
Face recognition (FR) systems have a growing effect on critical decision-making processes. Recent works have shown that FR solutions show strong performance differences based on the user's demographics. However, to enable a trustworthy FR technology, it is essential to know the influence of an extended range of facial attributes on FR beyond demographics. Therefore, in this work, we analyse FR bias over a wide range of attributes. We investigate the influence of 47 attributes on the verification performance of two popular FR models. The experiments were performed on the publicly available MAADFace attribute database with over 120M high-quality attribute annotations. To prevent misleading statements about biased performances, we introduced control group based validity values to decide if unbalanced test data causes the performance differences. The results demonstrate that also many non-demographic attributes strongly affect the recognition performance, such as accessories, hair-styles and colors, face shapes, or facial anomalies. The observations of this work show the strong need for further advances in making FR system more robust, explainable, and fair. Moreover, our findings might help to a better understanding of how FR networks work, to enhance the robustness of these networks, and to develop more generalized bias-mitigating face recognition solutions.