Model pruning can enable the deployment of neural networks in environments with resource constraints. While pruning may have a small effect on the overall performance of the model, it can exacerbate existing biases into the model such that subsets of samples see significantly degraded performance. In this paper, we introduce the performance weighted loss function, a simple modified cross-entropy loss function that can be used to limit the introduction of biases during pruning. Experiments using biased classifiers for facial classification and skin-lesion classification tasks demonstrate that the proposed method is a simple and effective tool that can enable existing pruning methods to be used in fairness sensitive contexts.
To see what is not in the image is one of the broader missions of computer vision. Technology to inpaint images has made significant progress with the coming of deep learning. This paper proposes a method to tackle occlusion specific to human faces. Virtual presence is a promising direction in communication and recreation for the future. However, Virtual Reality (VR) headsets occlude a significant portion of the face, hindering the photo-realistic appearance of the face in the virtual world. State-of-the-art image inpainting methods for de-occluding the eye region does not give usable results. To this end, we propose a working solution that gives usable results to tackle this problem enabling the use of the real-time photo-realistic de-occluded face of the user in VR settings.
In the past years, face recognition technologies have shown impressive recognition performance, mainly due to recent developments in deep convolutional neural networks. Notwithstanding those improvements, several challenges which affect the performance of face recognition systems remain. In this work, we investigate the impact that facial tattoos and paintings have on current face recognition systems. To this end, we first collected an appropriate database containing image-pairs of individuals with and without facial tattoos or paintings. The assembled database was used to evaluate how facial tattoos and paintings affect the detection, quality estimation, as well as the feature extraction and comparison modules of a face recognition system. The impact on these modules was evaluated using state-of-the-art open-source and commercial systems. The obtained results show that facial tattoos and paintings affect all the tested modules, especially for images where a large area of the face is covered with tattoos or paintings. Our work is an initial case-study and indicates a need to design algorithms which are robust to the visual changes caused by facial tattoos and paintings.
Sarcasm is a form of irony that involves saying or writing something that is opposite or opposite to what one really means, often in a humorous or mocking way. It is often used to mock or mock someone or something, or to be humorous or amusing. Sarcasm is usually conveyed through tone of voice, facial expressions, or other forms of nonverbal communication, but it can also be indicated by the use of certain words or phrases that are typically associated with irony or humor. Sarcasm detection is difficult because it relies on context and non-verbal cues. It can also be culturally specific, subjective and ambiguous. In this work, we fine-tune the RoBERTa based sarcasm detection model presented in Abaskohi et al. [2022] to get to within 0.02 F1 of the state-of-the-art (Hercog et al. [2022]) on the iSarcasm dataset (Oprea and Magdy [2019]). This performance is achieved by augmenting iSarcasm with a pruned version of the Self Annotated Reddit Corpus (SARC) (Khodak et al. [2017]). Our pruned version is 100 times smaller than the subset of SARC used to train the state-of-the-art model.
Human head pose estimation is an essential problem in facial analysis in recent years that has a lot of computer vision applications such as gaze estimation, virtual reality, and driver assistance. Because of the importance of the head pose estimation problem, it is necessary to design a compact model to resolve this task in order to reduce the computational cost when deploying on facial analysis-based applications such as large camera surveillance systems, AI cameras while maintaining accuracy. In this work, we propose a lightweight model that effectively addresses the head pose estimation problem. Our approach has two main steps. 1) We first train many teacher models on the synthesis dataset - 300W-LPA to get the head pose pseudo labels. 2) We design an architecture with the ResNet18 backbone and train our proposed model with the ensemble of these pseudo labels via the knowledge distillation process. To evaluate the effectiveness of our model, we use AFLW-2000 and BIWI - two real-world head pose datasets. Experimental results show that our proposed model significantly improves the accuracy in comparison with the state-of-the-art head pose estimation methods. Furthermore, our model has the real-time speed of $\sim$300 FPS when inferring on Tesla V100.
Facial features deformed according to the intended facial expression. Specific facial features are associated with specific facial expression, i.e. happy means the deformation of mouth. This paper presents the study of facial feature deformation for each facial expression by using an optical flow algorithm and segmented into three different regions of interest. The deformation of facial features shows the relation between facial the and facial expression. Based on the experiments, the deformations of eye and mouth are significant in all expressions except happy. For happy expression, cheeks and mouths are the significant regions. This work also suggests that different facial features' intensity varies in the way that they contribute to the recognition of the different facial expression intensity. The maximum magnitude across all expressions is shown by the mouth for surprise expression which is 9x10-4. While the minimum magnitude is shown by the mouth for angry expression which is 0.4x10-4.
Fetal alcohol syndrome (FAS) caused by prenatal alcohol exposure can result in a series of cranio-facial anomalies, and behavioral and neurocognitive problems. Current diagnosis of FAS is typically done by identifying a set of facial characteristics, which are often obtained by manual examination. Anatomical landmark detection, which provides rich geometric information, is important to detect the presence of FAS associated facial anomalies. This imaging application is characterized by large variations in data appearance and limited availability of labeled data. Current deep learning-based heatmap regression methods designed for facial landmark detection in natural images assume availability of large datasets and are therefore not wellsuited for this application. To address this restriction, we develop a new regularized transfer learning approach that exploits the knowledge of a network learned on large facial recognition datasets. In contrast to standard transfer learning which focuses on adjusting the pre-trained weights, the proposed learning approach regularizes the model behavior. It explicitly reuses the rich visual semantics of a domain-similar source model on the target task data as an additional supervisory signal for regularizing landmark detection optimization. Specifically, we develop four regularization constraints for the proposed transfer learning, including constraining the feature outputs from classification and intermediate layers, as well as matching activation attention maps in both spatial and channel levels. Experimental evaluation on a collected clinical imaging dataset demonstrate that the proposed approach can effectively improve model generalizability under limited training samples, and is advantageous to other approaches in the literature.
Facial editing is an important task in vision and graphics with numerous applications. However, existing works are incapable to deliver a continuous and fine-grained editing mode (e.g., editing a slightly smiling face to a big laughing one) with natural interactions with users. In this work, we propose Talk-to-Edit, an interactive facial editing framework that performs fine-grained attribute manipulation through dialog between the user and the system. Our key insight is to model a continual "semantic field" in the GAN latent space. 1) Unlike previous works that regard the editing as traversing straight lines in the latent space, here the fine-grained editing is formulated as finding a curving trajectory that respects fine-grained attribute landscape on the semantic field. 2) The curvature at each step is location-specific and determined by the input image as well as the users' language requests. 3) To engage the users in a meaningful dialog, our system generates language feedback by considering both the user request and the current state of the semantic field. We also contribute CelebA-Dialog, a visual-language facial editing dataset to facilitate large-scale study. Specifically, each image has manually annotated fine-grained attribute annotations as well as template-based textual descriptions in natural language. Extensive quantitative and qualitative experiments demonstrate the superiority of our framework in terms of 1) the smoothness of fine-grained editing, 2) the identity/attribute preservation, and 3) the visual photorealism and dialog fluency. Notably, user study validates that our overall system is consistently favored by around 80% of the participants. Our project page is https://www.mmlab-ntu.com/project/talkedit/.
Monocular 3D human performance capture is indispensable for many applications in computer graphics and vision for enabling immersive experiences. However, detailed capture of humans requires tracking of multiple aspects, including the skeletal pose, the dynamic surface, which includes clothing, hand gestures as well as facial expressions. No existing monocular method allows joint tracking of all these components. To this end, we propose HiFECap, a new neural human performance capture approach, which simultaneously captures human pose, clothing, facial expression, and hands just from a single RGB video. We demonstrate that our proposed network architecture, the carefully designed training strategy, and the tight integration of parametric face and hand models to a template mesh enable the capture of all these individual aspects. Importantly, our method also captures high-frequency details, such as deforming wrinkles on the clothes, better than the previous works. Furthermore, we show that HiFECap outperforms the state-of-the-art human performance capture approaches qualitatively and quantitatively while for the first time capturing all aspects of the human.
One of the key research areas in computer vision addressed by a vast number of publications is the processing and understanding of images containing human faces. The most often addressed tasks include face detection, facial landmark localization, face recognition and facial expression analysis. Other, more specialized tasks such as affective computing, the extraction of vital signs from videos or analysis of social interaction usually require one or several of the aforementioned tasks that have to be performed. In our work, we analyze that a large number of tasks for facial image processing in thermal infrared images that are currently solved using specialized rule-based methods or not solved at all can be addressed with modern learning-based approaches. We have used USTC-NVIE database for training of a number of machine learning algorithms for facial landmark localization.