Facial recognition is an AI-based technique for identifying or confirming an individual's identity using their face. It maps facial features from an image or video and then compares the information with a collection of known faces to find a match.




Multimodal emotion recognition plays a key role in many domains, including mental health monitoring, educational interaction, and human-computer interaction. However, existing methods often face three major challenges: unbalanced category distribution, the complexity of dynamic facial action unit time modeling, and the difficulty of feature fusion due to modal heterogeneity. With the explosive growth of multimodal data in social media scenarios, the need for building an efficient cross-modal fusion framework for emotion recognition is becoming increasingly urgent. To this end, this paper proposes Multimodal Cross-Attention Network and Contrastive Learning (MCN-CL) for multimodal emotion recognition. It uses a triple query mechanism and hard negative mining strategy to remove feature redundancy while preserving important emotional cues, effectively addressing the issues of modal heterogeneity and category imbalance. Experiment results on the IEMOCAP and MELD datasets show that our proposed method outperforms state-of-the-art approaches, with Weighted F1 scores improving by 3.42% and 5.73%, respectively.
We present Empathic Prompting, a novel framework for multimodal human-AI interaction that enriches Large Language Model (LLM) conversations with implicit non-verbal context. The system integrates a commercial facial expression recognition service to capture users' emotional cues and embeds them as contextual signals during prompting. Unlike traditional multimodal interfaces, empathic prompting requires no explicit user control; instead, it unobtrusively augments textual input with affective information for conversational and smoothness alignment. The architecture is modular and scalable, allowing integration of additional non-verbal modules. We describe the system design, implemented through a locally deployed DeepSeek instance, and report a preliminary service and usability evaluation (N=5). Results show consistent integration of non-verbal input into coherent LLM outputs, with participants highlighting conversational fluidity. Beyond this proof of concept, empathic prompting points to applications in chatbot-mediated communication, particularly in domains like healthcare or education, where users' emotional signals are critical yet often opaque in verbal exchanges.
Compared to 2D data, the scale of point cloud data in different domains available for training, is quite limited. Researchers have been trying to combine these data of different domains for masked autoencoder (MAE) pre-training to leverage such a data scarcity issue. However, the prior knowledge learned from mixed domains may not align well with the downstream 3D point cloud analysis tasks, leading to degraded performance. To address such an issue, we propose the Domain-Adaptive Point Cloud Masked Autoencoder (DAP-MAE), an MAE pre-training method, to adaptively integrate the knowledge of cross-domain datasets for general point cloud analysis. In DAP-MAE, we design a heterogeneous domain adapter that utilizes an adaptation mode during pre-training, enabling the model to comprehensively learn information from point clouds across different domains, while employing a fusion mode in the fine-tuning to enhance point cloud features. Meanwhile, DAP-MAE incorporates a domain feature generator to guide the adaptation of point cloud features to various downstream tasks. With only one pre-training, DAP-MAE achieves excellent performance across four different point cloud analysis tasks, reaching 95.18% in object classification on ScanObjectNN and 88.45% in facial expression recognition on Bosphorus.




Micro expression recognition (MER) is crucial for inferring genuine emotion. Applying a multimodal large language model (MLLM) to this task enables spatio-temporal analysis of facial motion and provides interpretable descriptions. However, there are still two core challenges: (1) The entanglement of static appearance and dynamic motion cues prevents the model from focusing on subtle motion; (2) Textual labels in existing MER datasets do not fully correspond to underlying facial muscle movements, creating a semantic gap between text supervision and physical motion. To address these issues, we propose DEFT-LLM, which achieves motion semantic alignment by multi-expert disentanglement. We first introduce Uni-MER, a motion-driven instruction dataset designed to align text with local facial motion. Its construction leverages dual constraints from optical flow and Action Unit (AU) labels to ensure spatio-temporal consistency and reasonable correspondence to the movements. We then design an architecture with three experts to decouple facial dynamics into independent and interpretable representations (structure, dynamic textures, and motion-semantics). By integrating the instruction-aligned knowledge from Uni-MER into DEFT-LLM, our method injects effective physical priors for micro expressions while also leveraging the cross modal reasoning ability of large language models, thus enabling precise capture of subtle emotional cues. Experiments on multiple challenging MER benchmarks demonstrate state-of-the-art performance, as well as a particular advantage in interpretable modeling of local facial motion.




Facial micro-expressions, characterized by their subtle and brief nature, are valuable indicators of genuine emotions. Despite their significance in psychology, security, and behavioral analysis, micro-expression recognition remains challenging due to the difficulty of capturing subtle facial movements. Optical flow has been widely employed as an input modality for this task due to its effectiveness. However, most existing methods compute optical flow only between the onset and apex frames, thereby overlooking essential motion information in the apex-to-offset phase. To address this limitation, we first introduce a comprehensive motion representation, termed Magnitude-Modulated Combined Optical Flow (MM-COF), which integrates motion dynamics from both micro-expression phases into a unified descriptor suitable for direct use in recognition networks. Building upon this principle, we then propose FMANet, a novel end-to-end neural network architecture that internalizes the dual-phase analysis and magnitude modulation into learnable modules. This allows the network to adaptively fuse motion cues and focus on salient facial regions for classification. Experimental evaluations on the MMEW, SMIC, CASME-II, and SAMM datasets, widely recognized as standard benchmarks, demonstrate that our proposed MM-COF representation and FMANet outperforms existing methods, underscoring the potential of a learnable, dual-phase framework in advancing micro-expression recognition.
We introduce a novel method for Photo Dating which estimates the year a photograph was taken by leveraging information from the faces of people present in the image. To facilitate this research, we publicly release CSFD-1.6M, a new dataset containing over 1.6 million annotated faces, primarily from movie stills, with identity and birth year annotations. Uniquely, our dataset provides annotations for multiple individuals within a single image, enabling the study of multi-face information aggregation. We propose a probabilistic framework that formally combines visual evidence from modern face recognition and age estimation models, and career-based temporal priors to infer the photo capture year. Our experiments demonstrate that aggregating evidence from multiple faces consistently improves the performance and the approach significantly outperforms strong, scene-based baselines, particularly for images containing several identifiable individuals.
Face recognition systems are increasingly deployed across a wide range of applications, including smartphone authentication, access control, and border security. However, these systems remain vulnerable to presentation attacks (PAs), which can significantly compromise their reliability. In this work, we introduce a new dataset focused on a novel and realistic presentation attack instrument called Nylon Face Masks (NFMs), designed to simulate advanced 3D spoofing scenarios. NFMs are particularly concerning due to their elastic structure and photorealistic appearance, which enable them to closely mimic the victim's facial geometry when worn by an attacker. To reflect real-world smartphone-based usage conditions, we collected the dataset using an iPhone 11 Pro, capturing 3,760 bona fide samples from 100 subjects and 51,281 NFM attack samples across four distinct presentation scenarios involving both humans and mannequins. We benchmark the dataset using five state-of-the-art PAD methods to evaluate their robustness under unseen attack conditions. The results demonstrate significant performance variability across methods, highlighting the challenges posed by NFMs and underscoring the importance of developing PAD techniques that generalise effectively to emerging spoofing threats.
An efficient and data-driven encoding scheme is proposed to enhance the performance of variational quantum classifiers. This encoding is specially designed for complex datasets like images and seeks to help the classification task by producing input states that form well-separated clusters in the Hilbert space according to their classification labels. The encoding circuit is trained using a triplet loss function inspired by classical facial recognition algorithms, and class separability is measured via average trace distances between the encoded density matrices. Benchmark tests performed on various binary classification tasks on MNIST and MedMNIST datasets demonstrate considerable improvement over amplitude encoding with the same VQC structure while requiring a much lower circuit depth.




Digital beautification through social media filters has become increasingly popular, raising concerns about the reliability of facial images and videos and the effectiveness of automated face analysis. This issue is particularly critical for digital manipulation detectors, systems aiming at distinguishing between genuine and manipulated data, especially in cases involving deepfakes and morphing attacks designed to deceive humans and automated facial recognition. This study examines whether beauty filters impact the performance of deepfake and morphing attack detectors. We perform a comprehensive analysis, evaluating multiple state-of-the-art detectors on benchmark datasets before and after applying various smoothing filters. Our findings reveal performance degradation, highlighting vulnerabilities introduced by facial enhancements and underscoring the need for robust detection models resilient to such alterations.
Facial expression classification remains a challenging task due to the high dimensionality and inherent complexity of facial image data. This paper presents Hy-Facial, a hybrid feature extraction framework that integrates both deep learning and traditional image processing techniques, complemented by a systematic investigation of dimensionality reduction strategies. The proposed method fuses deep features extracted from the Visual Geometry Group 19-layer network (VGG19) with handcrafted local descriptors and the scale-invariant feature transform (SIFT) and Oriented FAST and Rotated BRIEF (ORB) algorithms, to obtain rich and diverse image representations. To mitigate feature redundancy and reduce computational complexity, we conduct a comprehensive evaluation of dimensionality reduction techniques and feature extraction. Among these, UMAP is identified as the most effective, preserving both local and global structures of the high-dimensional feature space. The Hy-Facial pipeline integrated VGG19, SIFT, and ORB for feature extraction, followed by K-means clustering and UMAP for dimensionality reduction, resulting in a classification accuracy of 83. 3\% in the facial expression recognition (FER) dataset. These findings underscore the pivotal role of dimensionality reduction not only as a pre-processing step but as an essential component in improving feature quality and overall classification performance.