What is facial recognition? Facial recognition is an AI-based technique for identifying or confirming an individual's identity using their face. It maps facial features from an image or video and then compares the information with a collection of known faces to find a match.
Papers and Code
Mar 31, 2025
Abstract:Multimodal emotion recognition in conversation (MERC), the task of identifying the emotion label for each utterance in a conversation, is vital for developing empathetic machines. Current MLLM-based MERC studies focus mainly on capturing the speaker's textual or vocal characteristics, but ignore the significance of video-derived behavior information. Different from text and audio inputs, learning videos with rich facial expression, body language and posture, provides emotion trigger signals to the models for more accurate emotion predictions. In this paper, we propose a novel behavior-aware MLLM-based framework (BeMERC) to incorporate speaker's behaviors, including subtle facial micro-expression, body language and posture, into a vanilla MLLM-based MERC model, thereby facilitating the modeling of emotional dynamics during a conversation. Furthermore, BeMERC adopts a two-stage instruction tuning strategy to extend the model to the conversations scenario for end-to-end training of a MERC predictor. Experiments demonstrate that BeMERC achieves superior performance than the state-of-the-art methods on two benchmark datasets, and also provides a detailed discussion on the significance of video-derived behavior information in MERC.
Via

Mar 14, 2025
Abstract:Compound Expression Recognition (CER) is crucial for understanding human emotions and improving human-computer interaction. However, CER faces challenges due to the complexity of facial expressions and the difficulty of capturing subtle emotional cues. To address these issues, we propose a novel approach leveraging Large Vision-Language Models (LVLMs). Our method employs a two-stage fine-tuning process: first, pre-trained LVLMs are fine-tuned on basic facial expressions to establish foundational patterns; second, the model is further optimized on a compound-expression dataset to refine visual-language feature interactions. Our approach achieves advanced accuracy on the RAF-DB dataset and demonstrates strong zero-shot generalization on the C-EXPR-DB dataset, showcasing its potential for real-world applications in emotion analysis and human-computer interaction.
Via

Jan 25, 2025
Abstract:In human-centric scenes, the ability to simultaneously understand visual and auditory information is crucial. While recent omni models can process multiple modalities, they generally lack effectiveness in human-centric scenes due to the absence of large-scale, specialized datasets and non-targeted architectures. In this work, we developed HumanOmni, the industry's first human-centric Omni-multimodal large language model. We constructed a dataset containing over 2.4 million human-centric video clips with detailed captions and more than 14 million instructions, facilitating the understanding of diverse human-centric scenes. HumanOmni includes three specialized branches for understanding different types of scenes. It adaptively fuses features from these branches based on user instructions, significantly enhancing visual understanding in scenes centered around individuals. Moreover, HumanOmni integrates audio features to ensure a comprehensive understanding of environments and individuals. Our experiments validate HumanOmni's advanced capabilities in handling human-centric scenes across a variety of tasks, including emotion recognition, facial expression description, and action understanding. Our model will be open-sourced to facilitate further development and collaboration within both academia and industry.
Via

Jan 30, 2025
Abstract:Under Display Camera (UDC) is an advanced imaging system that places a digital camera lens underneath a display panel, effectively concealing the camera. However, the display panel significantly degrades captured images or videos, introducing low transmittance, blur, noise, and flare issues. Tackling such issues is challenging because of the complex degradation of UDCs, including diverse flare patterns. Despite extensive research on UDC images and their restoration models, studies on videos have yet to be significantly explored. While two UDC video datasets exist, they primarily focus on unrealistic or synthetic UDC degradation rather than real-world UDC degradation. In this paper, we propose a real-world UDC video dataset called UDC-VIT. Unlike existing datasets, only UDC-VIT exclusively includes human motions that target facial recognition. We propose a video-capturing system to simultaneously acquire non-degraded and UDC-degraded videos of the same scene. Then, we align a pair of captured videos frame by frame, using discrete Fourier transform (DFT). We compare UDC-VIT with six representative UDC still image datasets and two existing UDC video datasets. Using six deep-learning models, we compare UDC-VIT and an existing synthetic UDC video dataset. The results indicate the ineffectiveness of models trained on earlier synthetic UDC video datasets, as they do not reflect the actual characteristics of UDC-degraded videos. We also demonstrate the importance of effective UDC restoration by evaluating face recognition accuracy concerning PSNR, SSIM, and LPIPS scores. UDC-VIT enables further exploration in the UDC video restoration and offers better insights into the challenge. UDC-VIT is available at our project site.
* Main body (10 pages, 9 Figures, 3 Tables), References (4 pages),
Appendix (15 pages, 11 Figures, 6 Tables)
Via

Mar 27, 2025
Abstract:Micro-expressions are short bursts of emotion that are difficult to hide. Their detection in children is an important cue to assist psychotherapists in conducting better therapy. However, existing research on the detection of micro-expressions has focused on adults, whose expressions differ in their characteristics from those of children. The lack of research is a direct consequence of the lack of a child-based micro-expressions dataset as it is much more challenging to capture children's facial expressions due to the lack of predictability and controllability. This study compiles a dataset of spontaneous child micro-expression videos, the first of its kind, to the best of the authors knowledge. The dataset is captured in the wild using video conferencing software. This dataset enables us to then explore key features and differences between adult and child micro-expressions. This study also establishes a baseline for the automated spotting and recognition of micro-expressions in children using three approaches comprising of hand-created and learning-based approaches.
Via

Jan 15, 2025
Abstract:Facial recognition models are increasingly employed by commercial enterprises, government agencies, and cloud service providers for identity verification, consumer services, and surveillance. These models are often trained using vast amounts of facial data processed and stored in cloud-based platforms, raising significant privacy concerns. Users' facial images may be exploited without their consent, leading to potential data breaches and misuse. This survey presents a comprehensive review of current methods aimed at preserving facial image privacy in cloud-based services. We categorize these methods into two primary approaches: image obfuscation-based protection and adversarial perturbation-based protection. We provide an in-depth analysis of both categories, offering qualitative and quantitative comparisons of their effectiveness. Additionally, we highlight unresolved challenges and propose future research directions to improve privacy preservation in cloud computing environments.
Via

Jan 21, 2025
Abstract:Low-light image enhancement (LLE) aims to improve the visual quality of images captured in poorly lit conditions, which often suffer from low brightness, low contrast, noise, and color distortions. These issues hinder the performance of computer vision tasks such as object detection, facial recognition, and autonomous driving.Traditional enhancement techniques, such as multi-scale fusion and histogram equalization, fail to preserve fine details and often struggle with maintaining the natural appearance of enhanced images under complex lighting conditions. Although the Retinex theory provides a foundation for image decomposition, it often amplifies noise, leading to suboptimal image quality. In this paper, we propose the Dual Light Enhance Network (DLEN), a novel architecture that incorporates two distinct attention mechanisms, considering both spatial and frequency domains. Our model introduces a learnable wavelet transform module in the illumination estimation phase, preserving high- and low-frequency components to enhance edge and texture details. Additionally, we design a dual-branch structure that leverages the power of the Transformer architecture to enhance both the illumination and structural components of the image.Through extensive experiments, our model outperforms state-of-the-art methods on standard benchmarks.Code is available here: https://github.com/LaLaLoXX/DLEN
* 10pages,6figures
Via

Jan 30, 2025
Abstract:Facial Emotion Recognition has emerged as increasingly pivotal in the domain of User Experience, notably within modern usability testing, as it facilitates a deeper comprehension of user satisfaction and engagement. This study aims to extend the ResEmoteNet model by employing a knowledge distillation framework to develop Mini-ResEmoteNet models - lightweight student models - tailored for usability testing. Experiments were conducted on the FER2013 and RAF-DB datasets to assess the efficacy of three student model architectures: Student Model A, Student Model B, and Student Model C. Their development involves reducing the number of feature channels in each layer of the teacher model by approximately 50%, 75%, and 87.5%. Demonstrating exceptional performance on the FER2013 dataset, Student Model A (E1) achieved a test accuracy of 76.33%, marking a 0.21% absolute improvement over EmoNeXt. Moreover, the results exhibit absolute improvements in terms of inference speed and memory usage during inference compared to the ResEmoteNet model. The findings indicate that the proposed methods surpass other state-of-the-art approaches.
* 5 pages with 4 figures
Via

Jan 21, 2025
Abstract:Video deblurring is essential task for autonomous driving, facial recognition, and security surveillance. Traditional methods directly estimate motion blur kernels, often introducing artifacts and leading to poor results. Recent approaches utilize the detection of sharp frames within video sequences to enhance deblurring. However, existing datasets rely on fixed number of sharp frames, which may be too restrictive for some applications and may introduce a bias during model training. To address these limitations and enhance domain adaptability, this work first introduces GoPro Random Sharp (GoProRS), a new dataset where the the frequency of sharp frames within the sequence is customizable, allowing more diverse training and testing scenarios. Furthermore, it presents a novel video deblurring model, called SPEINet, that integrates sharp frame features into blurry frame reconstruction through an attention-based encoder-decoder architecture, a lightweight yet robust sharp frame detection and an edge extraction phase. Extensive experimental results demonstrate that SPEINet outperforms state-of-the-art methods across multiple datasets, achieving an average of +3.2% PSNR improvement over recent techniques. Given such promising results, we believe that both the proposed model and dataset pave the way for future advancements in video deblurring based on the detection of sharp frames.
* Under review in Pattern Recognition
Via

Mar 05, 2025
Abstract:Manual attendance tracking at large-scale events, such as marriage functions or conferences, is often inefficient and prone to human error. To address this challenge, we propose an automated, cloud-based attendance tracking system that uses cameras mounted at the entrance and exit gates. The mounted cameras continuously capture video and send the video data to cloud services to perform real-time face detection and recognition. Unlike existing solutions, our system accurately identifies attendees even when they are not looking directly at the camera, allowing natural movements, such as looking around or talking while walking. To the best of our knowledge, this is the first system to achieve high recognition rates under such dynamic conditions. Our system demonstrates overall 90% accuracy, with each video frame processed in 5 seconds, ensuring real time operation without frame loss. In addition, notifications are sent promptly to security personnel within the same latency. This system achieves 100% accuracy for individuals without facial obstructions and successfully recognizes all attendees appearing within the camera's field of view, providing a robust solution for attendee recognition in large-scale social events.
Via
