With the recent world-wide COVID-19 pandemic, using face masks have become an important part of our lives. People are encouraged to cover their faces when in public area to avoid the spread of infection. The use of these face masks has raised a serious question on the accuracy of the facial recognition system used for tracking school/office attendance and to unlock phones. Many organizations use facial recognition as a means of authentication and have already developed the necessary datasets in-house to be able to deploy such a system. Unfortunately, masked faces make it difficult to be detected and recognized, thereby threatening to make the in-house datasets invalid and making such facial recognition systems inoperable. This paper addresses a methodology to use the current facial datasets by augmenting it with tools that enable masked faces to be recognized with low false-positive rates and high overall accuracy, without requiring the user dataset to be recreated by taking new pictures for authentication. We present an open-source tool, MaskTheFace to mask faces effectively creating a large dataset of masked faces. The dataset generated with this tool is then used towards training an effective facial recognition system with target accuracy for masked faces. We report an increase of 38% in the true positive rate for the Facenet system. We also test the accuracy of re-trained system on a custom real-world dataset MFR2 and report similar accuracy.
Face recognition in real-time scenarios is mainly affected by illumination, expression and pose variations and also by occlusion. This paper presents the framework for pose adaptive component-based face recognition system. The framework proposed deals with all the above mentioned issues. The steps involved in the presented framework are (i) facial landmark localisation, (ii) facial component extraction, (iii) pre-processing of facial image (iv) facial pose estimation (v) feature extraction using Local Binary Pattern Histograms of each component followed by (vi) fusion of pose adaptive classification of components. By employing pose adaptive classification, the recognition process is carried out on some part of database, based on estimated pose, instead of applying the recognition process on the whole database. Pre-processing techniques employed to overcome the problems due to illumination variation are also discussed in this paper. Component-based techniques provide better recognition rates when face images are occluded compared to the holistic methods. Our method is simple, feasible and provides better results when compared to other holistic methods.
Automatic facial action unit (AU) recognition has attracted great attention but still remains a challenging task, as subtle changes of local facial muscles are difficult to thoroughly capture. Most existing AU recognition approaches leverage geometry information in a straightforward 2D or 3D manner, which either ignore 3D manifold information or suffer from high computational costs. In this paper, we propose a novel geodesic guided convolution (GeoConv) for AU recognition by embedding 3D manifold information into 2D convolutions. Specifically, the kernel of GeoConv is weighted by our introduced geodesic weights, which are negatively correlated to geodesic distances on a coarsely reconstructed 3D face model. Moreover, based on GeoConv, we further develop an end-to-end trainable framework named GeoCNN for AU recognition. Extensive experiments on BP4D and DISFA benchmarks show that our approach significantly outperforms the state-of-the-art AU recognition methods.
Recognizing human emotion/expressions automatically is quite an expected ability for intelligent robotics, as it can promote better communication and cooperation with humans. Current deep-learning-based algorithms may achieve impressive performance in some lab-controlled environments, but they always fail to recognize the expressions accurately for the uncontrolled in-the-wild situation. Fortunately, facial action units (AU) describe subtle facial behaviors, and they can help distinguish uncertain and ambiguous expressions. In this work, we explore the correlations among the action units and facial expressions, and devise an AU-Expression Knowledge Constrained Representation Learning (AUE-CRL) framework to learn the AU representations without AU annotations and adaptively use representations to facilitate facial expression recognition. Specifically, it leverages AU-expression correlations to guide the learning of the AU classifiers, and thus it can obtain AU representations without incurring any AU annotations. Then, it introduces a knowledge-guided attention mechanism that mines useful AU representations under the constraint of AU-expression correlations. In this way, the framework can capture local discriminative and complementary features to enhance facial representation for facial expression recognition. We conduct experiments on the challenging uncontrolled datasets to demonstrate the superiority of the proposed framework over current state-of-the-art methods.
A person's face discloses important information about their affective state. Although there has been extensive research on recognition of facial expressions, the performance of existing approaches is challenged by facial occlusions. Facial occlusions are often treated as noise and discarded in recognition of affective states. However, hand over face occlusions can provide additional information for recognition of some affective states such as curiosity, frustration and boredom. One of the reasons that this problem has not gained attention is the lack of naturalistic occluded faces that contain hand over face occlusions as well as other types of occlusions. Traditional approaches for obtaining affective data are time demanding and expensive, which limits researchers in affective computing to work on small datasets. This limitation affects the generalizability of models and deprives researchers from taking advantage of recent advances in deep learning that have shown great success in many fields but require large volumes of data. In this paper, we first introduce a novel framework for synthesizing naturalistic facial occlusions from an initial dataset of non-occluded faces and separate images of hands, reducing the costly process of data collection and annotation. We then propose a model for facial occlusion type recognition to differentiate between hand over face occlusions and other types of occlusions such as scarves, hair, glasses and objects. Finally, we present a model to localize hand over face occlusions and identify the occluded regions of the face.
Extensive efforts have been devoted to recognizing facial action units (AUs). However, it is still challenging to recognize AUs from spontaneous facial displays especially when they are accompanied with speech. Different from all prior work that utilized visual observations for facial AU recognition, this paper presents a novel approach that recognizes speech-related AUs exclusively from audio signals based on the fact that facial activities are highly correlated with voice during speech. Specifically, dynamic and physiological relationships between AUs and phonemes are modeled through a continuous time Bayesian network (CTBN); then AU recognition is performed by probabilistic inference via the CTBN model. A pilot audiovisual AU-coded database has been constructed to evaluate the proposed audio-based AU recognition framework. The database consists of a "clean" subset with frontal and neutral faces and a challenging subset collected with large head movements and occlusions. Experimental results on this database show that the proposed CTBN model achieves promising recognition performance for 7 speech-related AUs and outperforms the state-of-the-art visual-based methods especially for those AUs that are activated at low intensities or "hardly visible" in the visual channel. Furthermore, the CTBN model yields more impressive recognition performance on the challenging subset, where the visual-based approaches suffer significantly.
The representation used for Facial Expression Recognition (FER) usually contain expression information along with other variations such as identity and illumination. In this paper, we propose a novel Disentangled Expression learning-Generative Adversarial Network (DE-GAN) to explicitly disentangle facial expression representation from identity information. In this learning by reconstruction method, facial expression representation is learned by reconstructing an expression image employing an encoder-decoder based generator. This expression representation is disentangled from identity component by explicitly providing the identity code to the decoder part of DE-GAN. The process of expression image reconstruction and disentangled expression representation learning is improved by performing expression and identity classification in the discriminator of DE-GAN. The disentangled facial expression representation is then used for facial expression recognition employing simple classifiers like SVM or MLP. The experiments are performed on publicly available and widely used face expression databases (CK+, MMI, Oulu-CASIA). The experimental results show that the proposed technique produces comparable results with state-of-the-art methods.
Face recognition system is one of the esteemed research areas in pattern recognition and computer vision as long as its major challenges. A few challenges in recognizing faces are blur, illumination, and varied expressions. Blur is natural while taking photographs using cameras, mobile phones, etc. Blur can be uniform and non-uniform. Usually non-uniform blur happens in images taken using handheld image devices. Distinguishing or handling a blurred image in a face recognition system is generally tough. Under varying lighting conditions, it is challenging to identify the person correctly. Diversified facial expressions such as happiness, sad, surprise, fear, anger changes or deforms the faces from normal images. Identifying faces with facial expressions is also a challenging task, due to the deformation caused by the facial expressions. To solve these issues, a pre-processing step was carried out after which Blur and Illumination-Robust Face recognition (BIRFR) algorithm was performed. The test image and training images with facial expression are transformed to neutral face using Facial expression removal (FER) peration. Every training image is transformed based on the optimal Transformation Spread Function (TSF), and illumination coefficients. Local Binary Pattern (LBP) features extracted from test image and transformed training image is used for classification.
Explainable face recognition is the problem of explaining why a facial matcher matches faces. In this paper, we provide the first comprehensive benchmark and baseline evaluation for explainable face recognition. We define a new evaluation protocol called the ``inpainting game'', which is a curated set of 3648 triplets (probe, mate, nonmate) of 95 subjects, which differ by synthetically inpainting a chosen facial characteristic like the nose, eyebrows or mouth creating an inpainted nonmate. An explainable face matcher is tasked with generating a network attention map which best explains which regions in a probe image match with a mated image, and not with an inpainted nonmate for each triplet. This provides ground truth for quantifying what image regions contribute to face matching. Furthermore, we provide a comprehensive benchmark on this dataset comparing five state of the art methods for network attention in face recognition on three facial matchers. This benchmark includes two new algorithms for network attention called subtree EBP and Density-based Input Sampling for Explanation (DISE) which outperform the state of the art by a wide margin. Finally, we show qualitative visualization of these network attention techniques on novel images, and explore how these explainable face recognition models can improve transparency and trust for facial matchers.
Feature description is one of the most frequently studied areas in the expert systems and machine learning. Effective encoding of the images is an essential requirement for accurate matching. These encoding schemes play a significant role in recognition and retrieval systems. Facial recognition systems should be effective enough to accurately recognize individuals under intrinsic and extrinsic variations of the system. The templates or descriptors used in these systems encode spatial relationships of the pixels in the local neighbourhood of an image. Features encoded using these hand crafted descriptors should be robust against variations such as; illumination, background, poses, and expressions. In this paper a novel hand crafted cascaded asymmetric local pattern (CALP) is proposed for retrieval and recognition facial image. The proposed descriptor uniquely encodes relationship amongst the neighbouring pixels in horizontal and vertical directions. The proposed encoding scheme has optimum feature length and shows significant improvement in accuracy under environmental and physiological changes in a facial image. State of the art hand crafted descriptors namely; LBP, LDGP, CSLBP, SLBP and CSLTP are compared with the proposed descriptor on most challenging datasets namely; Caltech-face, LFW, and CASIA-face-v5. Result analysis shows that, the proposed descriptor outperforms state of the art under uncontrolled variations in expressions, background, pose and illumination.