Abstract:The rapid proliferation of AI-generated images, powered by generative adversarial networks (GANs), diffusion models, and other synthesis techniques, has raised serious concerns about misinformation, copyright violations, and digital security. However, detecting such images in a generalized and robust manner remains a major challenge due to the vast diversity of generative models and data distributions. In this work, we present \textbf{Diversity Matters}, a novel framework that emphasizes data diversity and feature domain complementarity for AI-generated image detection. The proposed method introduces a feature-domain similarity filtering mechanism that discards redundant or highly similar samples across both inter-class and intra-class distributions, ensuring a more diverse and representative training set. Furthermore, we propose a dual-branch network that combines CLIP features from the pixel domain and the frequency domain to jointly capture semantic and structural cues, leading to improved generalization against unseen generative models and adversarial conditions. Extensive experiments on benchmark datasets demonstrate that the proposed approach significantly improves cross-model and cross-dataset performance compared to existing methods. \textbf{Diversity Matters} highlights the critical role of data and feature diversity in building reliable and robust detectors against the rapidly evolving landscape of synthetic content.
Abstract:Multimedia data, particularly images and videos, is integral to various applications, including surveillance, visual interaction, biometrics, evidence gathering, and advertising. However, amateur or skilled counterfeiters can simulate them to create deepfakes, often for slanderous motives. To address this challenge, several forensic methods have been developed to ensure the authenticity of the content. The effectiveness of these methods depends on their focus, with challenges arising from the diverse nature of manipulations. In this article, we analyze existing forensic methods and observe that each method has unique strengths in detecting deepfake traces by focusing on specific facial regions, such as the frame, face, lips, eyes, or nose. Considering these insights, we propose a novel hybrid approach called Face2Parts based on hierarchical feature representation ($HFR$) that takes advantage of coarse-to-fine information to improve deepfake detection. The proposed method involves extracting features from the frame, face, and key facial regions (i.e., lips, eyes, and nose) separately to explore the coarse-to-fine relationships. This approach enables us to capture inter-dependencies among facial regions using a channel-attention mechanism and deep triplet learning. We evaluated the proposed method on benchmark deepfake datasets in both intra-, inter-dataset, and inter-manipulation settings. The proposed method achieves an average AUC of 98.42\% on FF++, 79.80\% on CDF1, 85.34\% on CDF2, 89.41\% on DFD, 84.07\% on DFDC, 95.62\% on DTIM, 80.76\% on PDD, and 100\% on WLDR, respectively. The results demonstrate that our approach generalizes effectively and achieves promising performance to outperform the existing methods.