Abstract:The increasing prevalence of computer vision applications necessitates handling vast amounts of visual data, often containing personal information. While this technology offers significant benefits, it should not compromise privacy. Data privacy regulations emphasize the need for individual consent for processing personal data, hindering researchers' ability to collect high-quality datasets containing the faces of the individuals. This paper presents a deep learning-based face anonymization pipeline to overcome this challenge. Unlike most of the existing methods, our method leverages recent advancements in diffusion-based inpainting models, eliminating the need for training Generative Adversarial Networks. The pipeline employs a three-stage approach: face detection with RetinaNet, feature extraction with VGG-Face, and realistic face generation using the state-of-the-art BrushNet diffusion model. BrushNet utilizes the entire image, face masks, and text prompts specifying desired facial attributes like age, ethnicity, gender, and expression. This enables the generation of natural-looking images with unrecognizable individuals, facilitating the creation of privacy-compliant datasets for computer vision research.
Abstract:The increasing demand for large-scale visual data, coupled with strict privacy regulations, has driven research into anonymization methods that hide personal identities without seriously degrading data quality. In this paper, we explore the potential of face swapping methods to preserve privacy in video data. Through extensive evaluations focusing on temporal consistency, anonymity strength, and visual fidelity, we find that face swapping techniques can produce consistent facial transitions and effectively hide identities. These results underscore the suitability of face swapping for privacy-preserving video applications and lay the groundwork for future advancements in anonymization focused face-swapping models.
Abstract:Object counting has recently shifted towards class-agnostic counting (CAC), which addresses the challenge of counting objects across arbitrary categories, tackling a critical need in versatile counting systems. While humans effortlessly identify and count objects from diverse categories without prior knowledge, most counting methods remain restricted to enumerating instances of known classes, requiring extensive labeled datasets for training, and struggling under open-vocabulary settings. Conversely, CAC aims to count objects belonging to classes never seen during training, typically operating in a few-shot setting. In this paper, for the first time, we review advancements in CAC methodologies, categorizing them into three paradigms based on how target object classes can be specified: reference-based, reference-less, and open-world text-guided. Reference-based approaches have set performance benchmarks using exemplar-guided mechanisms. Reference-less methods eliminate exemplar dependency by leveraging inherent image patterns. Finally, open-world text-guided methods utilize vision-language models, enabling object class descriptions through textual prompts, representing a flexible and appealing solution. We analyze state-of-the-art techniques and we report their results on existing gold standard benchmarks, comparing their performance and identifying and discussing their strengths and limitations. Persistent challenges -- such as annotation dependency, scalability, and generalization -- are discussed, alongside future directions. We believe this survey serves as a valuable resource for researchers to understand the progressive developments and contributions over time and the current state-of-the-art of CAC, suggesting insights for future directions and challenges to be addressed.
Abstract:Face alignment is a crucial step in preparing face images for feature extraction in facial analysis tasks. For applications such as face recognition, facial expression recognition, and facial attribute classification, alignment is widely utilized during both training and inference to standardize the positions of key landmarks in the face. It is well known that the application and method of face alignment significantly affect the performance of facial analysis models. However, the impact of alignment on face image quality has not been thoroughly investigated. Current FIQA studies often assume alignment as a prerequisite but do not explicitly evaluate how alignment affects quality metrics, especially with the advent of modern deep learning-based detectors that integrate detection and landmark localization. To address this need, our study examines the impact of face alignment on face image quality scores. We conducted experiments on the LFW, IJB-B, and SCFace datasets, employing MTCNN and RetinaFace models for face detection and alignment. To evaluate face image quality, we utilized several assessment methods, including SER-FIQ, FaceQAN, DifFIQA, and SDD-FIQA. Our analysis included examining quality score distributions for the LFW and IJB-B datasets and analyzing average quality scores at varying distances in the SCFace dataset. Our findings reveal that face image quality assessment methods are sensitive to alignment. Moreover, this sensitivity increases under challenging real-life conditions, highlighting the importance of evaluating alignment's role in quality assessment.
Abstract:Maritime obstacle detection aims to detect possible obstacles for autonomous driving of unmanned surface vehicles. In the context of maritime obstacle detection, the water surface can act like a mirror on certain circumstances, causing reflections on imagery. Previous works have indicated surface reflections as a source of false positives for object detectors in maritime obstacle detection tasks. In this work, we show that surface reflections indeed adversely affect detector performance. We measure the effect of reflections by testing on two custom datasets, which we make publicly available. The first one contains imagery with reflections, while in the second reflections are inpainted. We show that the reflections reduce mAP by 1.2 to 9.6 points across various detectors. To remove false positives on reflections, we propose a novel filtering approach named Heatmap Based Sliding Filter. We show that the proposed method reduces the total number of false positives by 34.64% while minimally affecting true positives. We also conduct qualitative analysis and show that the proposed method indeed removes false positives on the reflections. The datasets can be found on https://github.com/SamedYalcin/MRAD.
Abstract:A face recognition model is typically trained on large datasets of images that may be collected from controlled environments. This results in performance discrepancies when applied to real-world scenarios due to the domain gap between clean and in-the-wild images. Therefore, some researchers have investigated the robustness of these models by analyzing synthetic degradations. Yet, existing studies have mostly focused on single degradation factors, which may not fully capture the complexity of real-world degradations. This work addresses this problem by analyzing the impact of both single and combined degradations using a real-world degradation pipeline extended with under/over-exposure conditions. We use the LFW dataset for our experiments and assess the model's performance based on verification accuracy. Results reveal that single and combined degradations show dissimilar model behavior. The combined effect of degradation significantly lowers performance even if its single effect is negligible. This work emphasizes the importance of accounting for real-world complexity to assess the robustness of face recognition models in real-world settings. The code is publicly available at https://github.com/ThEnded32/AnalyzingCombinedDegradations.
Abstract:Advancements like Generative Adversarial Networks have attracted the attention of researchers toward face image synthesis to generate ever more realistic images. Thereby, the need for the evaluation criteria to assess the realism of the generated images has become apparent. While FID utilized with InceptionV3 is one of the primary choices for benchmarking, concerns about InceptionV3's limitations for face images have emerged. This study investigates the behavior of diverse feature extractors -- InceptionV3, CLIP, DINOv2, and ArcFace -- considering a variety of metrics -- FID, KID, Precision\&Recall. While the FFHQ dataset is used as the target domain, as the source domains, the CelebA-HQ dataset and the synthetic datasets generated using StyleGAN2 and Projected FastGAN are used. Experiments include deep-down analysis of the features: $L_2$ normalization, model attention during extraction, and domain distributions in the feature space. We aim to give valuable insights into the behavior of feature extractors for evaluating face image synthesis methodologies. The code is publicly available at https://github.com/ThEnded32/AnalyzingFeatureExtractors.
Abstract:In the task of talking face generation, the objective is to generate a face video with lips synchronized to the corresponding audio while preserving visual details and identity information. Current methods face the challenge of learning accurate lip synchronization while avoiding detrimental effects on visual quality, as well as robustly evaluating such synchronization. To tackle these problems, we propose utilizing an audio-visual speech representation expert (AV-HuBERT) for calculating lip synchronization loss during training. Moreover, leveraging AV-HuBERT's features, we introduce three novel lip synchronization evaluation metrics, aiming to provide a comprehensive assessment of lip synchronization performance. Experimental results, along with a detailed ablation study, demonstrate the effectiveness of our approach and the utility of the proposed evaluation metrics.
Abstract:Convolutional Neural Networks (CNNs) have become widely adopted for medical image segmentation tasks, demonstrating promising performance. However, the inherent inductive biases in convolutional architectures limit their ability to model long-range dependencies and spatial correlations. While recent transformer-based architectures address these limitations by leveraging self-attention mechanisms to encode long-range dependencies and learn expressive representations, they often struggle to extract low-level features and are highly dependent on data availability. This motivated us for the development of GLIMS, a data-efficient attention-guided hybrid volumetric segmentation network. GLIMS utilizes Dilated Feature Aggregator Convolutional Blocks (DACB) to capture local-global feature correlations efficiently. Furthermore, the incorporated Swin Transformer-based bottleneck bridges the local and global features to improve the robustness of the model. Additionally, GLIMS employs an attention-guided segmentation approach through Channel and Spatial-Wise Attention Blocks (CSAB) to localize expressive features for fine-grained border segmentation. Quantitative and qualitative results on glioblastoma and multi-organ CT segmentation tasks demonstrate GLIMS' effectiveness in terms of complexity and accuracy. GLIMS demonstrated outstanding performance on BraTS2021 and BTCV datasets, surpassing the performance of Swin UNETR. Notably, GLIMS achieved this high performance with a significantly reduced number of trainable parameters. Specifically, GLIMS has 47.16M trainable parameters and 72.30G FLOPs, while Swin UNETR has 61.98M trainable parameters and 394.84G FLOPs. The code is publicly available on https://github.com/yaziciz/GLIMS.
Abstract:Glioblastoma is a highly aggressive and malignant brain tumor type that requires early diagnosis and prompt intervention. Due to its heterogeneity in appearance, developing automated detection approaches is challenging. To address this challenge, Artificial Intelligence (AI)-driven approaches in healthcare have generated interest in efficiently diagnosing and evaluating brain tumors. The Brain Tumor Segmentation Challenge (BraTS) is a platform for developing and assessing automated techniques for tumor analysis using high-quality, clinically acquired MRI data. In our approach, we utilized a multi-scale, attention-guided and hybrid U-Net-shaped model -- GLIMS -- to perform 3D brain tumor segmentation in three regions: Enhancing Tumor (ET), Tumor Core (TC), and Whole Tumor (WT). The multi-scale feature extraction provides better contextual feature aggregation in high resolutions and the Swin Transformer blocks improve the global feature extraction at deeper levels of the model. The segmentation mask generation in the decoder branch is guided by the attention-refined features gathered from the encoder branch to enhance the important attributes. Moreover, hierarchical supervision is used to train the model efficiently. Our model's performance on the validation set resulted in 92.19, 87.75, and 83.18 Dice Scores and 89.09, 84.67, and 82.15 Lesion-wise Dice Scores in WT, TC, and ET, respectively. The code is publicly available at https://github.com/yaziciz/GLIMS.