Facial recognition is an AI-based technique for identifying or confirming an individual's identity using their face. It maps facial features from an image or video and then compares the information with a collection of known faces to find a match.
Adversarial examples have revealed the vulnerability of deep learning models and raised serious concerns about information security. The transfer-based attack is a hot topic in black-box attacks that are practical to real-world scenarios where the training datasets, parameters, and structure of the target model are unknown to the attacker. However, few methods consider the particularity of class-specific deep models for fine-grained vision tasks, such as face recognition (FR), giving rise to unsatisfactory attacking performance. In this work, we first investigate what in a face exactly contributes to the embedding learning of FR models and find that both decisive and auxiliary facial features are specific to each FR model, which is quite different from the biological mechanism of human visual system. Accordingly we then propose a novel attack method named Attention-aggregated Attack (AAA) to enhance the transferability of adversarial examples against FR, which is inspired by the attention divergence and aims to destroy the facial features that are critical for the decision-making of other FR models by imitating their attentions on the clean face images. Extensive experiments conducted on various FR models validate the superiority and robust effectiveness of the proposed method over existing methods.
Electroencephalography (EEG) signals provide a promising and involuntary reflection of brain activity related to emotional states, offering significant advantages over behavioral cues like facial expressions. However, EEG signals are often noisy, affected by artifacts, and vary across individuals, complicating emotion recognition. While multimodal approaches have used Peripheral Physiological Signals (PPS) like GSR to complement EEG, they often overlook the dynamic synchronization and consistent semantics between the modalities. Additionally, the temporal dynamics of emotional fluctuations across different time resolutions in PPS remain underexplored. To address these challenges, we propose PhysioSync, a novel pre-training framework leveraging temporal and cross-modal contrastive learning, inspired by physiological synchronization phenomena. PhysioSync incorporates Cross-Modal Consistency Alignment (CM-CA) to model dynamic relationships between EEG and complementary PPS, enabling emotion-related synchronizations across modalities. Besides, it introduces Long- and Short-Term Temporal Contrastive Learning (LS-TCL) to capture emotional synchronization at different temporal resolutions within modalities. After pre-training, cross-resolution and cross-modal features are hierarchically fused and fine-tuned to enhance emotion recognition. Experiments on DEAP and DREAMER datasets demonstrate PhysioSync's advanced performance under uni-modal and cross-modal conditions, highlighting its effectiveness for EEG-centered emotion recognition.




Facial landmark tracking plays a vital role in applications such as facial recognition, expression analysis, and medical diagnostics. In this paper, we consider the performance of the Extended Kalman Filter (EKF) and Unscented Kalman Filter (UKF) in tracking 3D facial motion in both deterministic and stochastic settings. We first analyze a noise-free environment where the state transition is purely deterministic, demonstrating that UKF outperforms EKF by achieving lower mean squared error (MSE) due to its ability to capture higher-order nonlinearities. However, when stochastic noise is introduced, EKF exhibits superior robustness, maintaining lower mean square error (MSE) compared to UKF, which becomes more sensitive to measurement noise and occlusions. Our results highlight that UKF is preferable for high-precision applications in controlled environments, whereas EKF is better suited for real-world scenarios with unpredictable noise. These findings provide practical insights for selecting the appropriate filtering technique in 3D facial tracking applications, such as motion capture and facial recognition.
Understanding and predicting human behavior has emerged as a core capability in various AI application domains such as autonomous driving, smart healthcare, surveillance systems, and social robotics. This paper defines the technical framework of Artificial Behavior Intelligence (ABI), which comprehensively analyzes and interprets human posture, facial expressions, emotions, behavioral sequences, and contextual cues. It details the essential components of ABI, including pose estimation, face and emotion recognition, sequential behavior analysis, and context-aware modeling. Furthermore, we highlight the transformative potential of recent advances in large-scale pretrained models, such as large language models (LLMs), vision foundation models, and multimodal integration models, in significantly improving the accuracy and interpretability of behavior recognition. Our research team has a strong interest in the ABI domain and is actively conducting research, particularly focusing on the development of intelligent lightweight models capable of efficiently inferring complex human behaviors. This paper identifies several technical challenges that must be addressed to deploy ABI in real-world applications including learning behavioral intelligence from limited data, quantifying uncertainty in complex behavior prediction, and optimizing model structures for low-power, real-time inference. To tackle these challenges, our team is exploring various optimization strategies including lightweight transformers, graph-based recognition architectures, energy-aware loss functions, and multimodal knowledge distillation, while validating their applicability in real-time environments.
Unlike spoken languages where the use of prosodic features to convey emotion is well studied, indicators of emotion in sign language remain poorly understood, creating communication barriers in critical settings. Sign languages present unique challenges as facial expressions and hand movements simultaneously serve both grammatical and emotional functions. To address this gap, we introduce EmoSign, the first sign video dataset containing sentiment and emotion labels for 200 American Sign Language (ASL) videos. We also collect open-ended descriptions of emotion cues. Annotations were done by 3 Deaf ASL signers with professional interpretation experience. Alongside the annotations, we include baseline models for sentiment and emotion classification. This dataset not only addresses a critical gap in existing sign language research but also establishes a new benchmark for understanding model capabilities in multimodal emotion recognition for sign languages. The dataset is made available at https://huggingface.co/datasets/catfang/emosign.

This paper explores the use of partially homomorphic encryption (PHE) for encrypted vector similarity search, with a focus on facial recognition and broader applications like reverse image search, recommendation engines, and large language models (LLMs). While fully homomorphic encryption (FHE) exists, we demonstrate that encrypted cosine similarity can be computed using PHE, offering a more practical alternative. Since PHE does not directly support cosine similarity, we propose a method that normalizes vectors in advance, enabling dot product calculations as a proxy. We also apply min-max normalization to handle negative dimension values. Experiments on the Labeled Faces in the Wild (LFW) dataset use DeepFace's FaceNet128d, FaceNet512d, and VGG-Face (4096d) models in a two-tower setup. Pre-encrypted embeddings are stored in one tower, while an edge device captures images, computes embeddings, and performs encrypted-plaintext dot products via additively homomorphic encryption. We implement this with LightPHE, evaluating Paillier, Damgard-Jurik, and Okamoto-Uchiyama schemes, excluding others due to performance or decryption complexity. Tests at 80-bit and 112-bit security (NIST-secure until 2030) compare PHE against FHE (via TenSEAL), analyzing encryption, decryption, operation time, cosine similarity loss, key/ciphertext sizes. Results show PHE is less computationally intensive, faster, and produces smaller ciphertexts/keys, making it well-suited for memory-constrained environments and real-world privacy-preserving encrypted similarity search.
Facial appearance editing is crucial for digital avatars, AR/VR, and personalized content creation, driving realistic user experiences. However, preserving identity with generative models is challenging, especially in scenarios with limited data availability. Traditional methods often require multiple images and still struggle with unnatural face shifts, inconsistent hair alignment, or excessive smoothing effects. To overcome these challenges, we introduce a novel diffusion-based framework, InstaFace, to generate realistic images while preserving identity using only a single image. Central to InstaFace, we introduce an efficient guidance network that harnesses 3D perspectives by integrating multiple 3DMM-based conditionals without introducing additional trainable parameters. Moreover, to ensure maximum identity retention as well as preservation of background, hair, and other contextual features like accessories, we introduce a novel module that utilizes feature embeddings from a facial recognition model and a pre-trained vision-language model. Quantitative evaluations demonstrate that our method outperforms several state-of-the-art approaches in terms of identity preservation, photorealism, and effective control of pose, expression, and lighting.




Facial expressions play a crucial role in human communication serving as a powerful and impactful means to express a wide range of emotions. With advancements in artificial intelligence and computer vision, deep neural networks have emerged as effective tools for facial emotion recognition. In this paper, we propose EmoNeXt, a novel deep learning framework for facial expression recognition based on an adapted ConvNeXt architecture network. We integrate a Spatial Transformer Network (STN) to focus on feature-rich regions of the face and Squeeze-and-Excitation blocks to capture channel-wise dependencies. Moreover, we introduce a self-attention regularization term, encouraging the model to generate compact feature vectors. We demonstrate the superiority of our model over existing state-of-the-art deep learning models on the FER2013 dataset regarding emotion classification accuracy.
Sign language is a fundamental means of communication for the deaf and hard-of-hearing (DHH) community, enabling nuanced expression through gestures, facial expressions, and body movements. Despite its critical role in facilitating interaction within the DHH population, significant barriers persist due to the limited fluency in sign language among the hearing population. Overcoming this communication gap through automatic sign language recognition (SLR) remains a challenge, particularly at a dynamic word-level, where temporal and spatial dependencies must be effectively recognized. While Convolutional Neural Networks have shown potential in SLR, they are computationally intensive and have difficulties in capturing global temporal dependencies between video sequences. To address these limitations, we propose a Video Vision Transformer (ViViT) model for word-level American Sign Language (ASL) recognition. Transformer models make use of self-attention mechanisms to effectively capture global relationships across spatial and temporal dimensions, which makes them suitable for complex gesture recognition tasks. The VideoMAE model achieves a Top-1 accuracy of 75.58% on the WLASL100 dataset, highlighting its strong performance compared to traditional CNNs with 65.89%. Our study demonstrates that transformer-based architectures have great potential to advance SLR, overcome communication barriers and promote the inclusion of DHH individuals.




Visual emotion analysis or recognition has gained considerable attention due to the growing interest in understanding how images can convey rich semantics and evoke emotions in human perception. However, visual emotion analysis poses distinctive challenges compared to traditional vision tasks, especially due to the intricate relationship between general visual features and the different affective states they evoke, known as the affective gap. Researchers have used deep representation learning methods to address this challenge of extracting generalized features from entire images. However, most existing methods overlook the importance of specific emotional attributes such as brightness, colorfulness, scene understanding, and facial expressions. Through this paper, we introduce A4Net, a deep representation network to bridge the affective gap by leveraging four key attributes: brightness (Attribute 1), colorfulness (Attribute 2), scene context (Attribute 3), and facial expressions (Attribute 4). By fusing and jointly training all aspects of attribute recognition and visual emotion analysis, A4Net aims to provide a better insight into emotional content in images. Experimental results show the effectiveness of A4Net, showcasing competitive performance compared to state-of-the-art methods across diverse visual emotion datasets. Furthermore, visualizations of activation maps generated by A4Net offer insights into its ability to generalize across different visual emotion datasets.