Abstract:Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a joint transformer with modality-specific embeddings, trained using multiple objectives including contrastive losses between modality pairs, reconstruction losses for images and text, and classification losses for DR severity grading according to ICDR and SDRG schemes. Experimental results on the Brazilian Multilabel Ophthalmological Dataset (BRSET) demonstrate significant improvements over baseline models. Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP's 1.29%, while maintaining state-of-the-art classification accuracy of 97.05% for SDRG and 97.97% for ICDR. Furthermore, zero-shot evaluation on the unseen DeepEyeNet dataset validates strong generalizability with 93.95% Recall@1 versus 0.22% for fine-tuned CLIP. These results demonstrate that our multimodal training approach effectively captures cross-modal relationships in the medical domain, establishing both superior retrieval capabilities and robust diagnostic performance.




Abstract:The rapid advancement of deepfake technologies, specifically designed to create incredibly lifelike facial imagery and video content, has ignited a remarkable level of interest and curiosity across many fields, including forensic analysis, cybersecurity and the innovative creation of digital characters. By harnessing the latest breakthroughs in deep learning methods, such as Generative Adversarial Networks, Variational Autoencoders, Few-Shot Learning Strategies, and Transformers, the outcomes achieved in generating deepfakes have been nothing short of astounding and transformative. Also, the ongoing evolution of detection technologies is being developed to counteract the potential for misuse associated with deepfakes, effectively addressing critical concerns that range from political manipulation to the dissemination of fake news and the ever-growing issue of cyberbullying. This comprehensive review paper meticulously investigates the most recent developments in deepfake generation and detection, including around 400 publications, providing an in-depth analysis of the cutting-edge innovations shaping this rapidly evolving landscape. Starting with a thorough examination of systematic literature review methodologies, we embark on a journey that delves into the complex technical intricacies inherent in the various techniques used for deepfake generation, comprehensively addressing the challenges faced, potential solutions available, and the nuanced details surrounding manipulation formulations. Subsequently, the paper is dedicated to accurately benchmarking leading approaches against prominent datasets, offering thorough assessments of the contributions that have significantly impacted these vital domains. Ultimately, we engage in a thoughtful discussion of the existing challenges, paving the way for continuous advancements in this critical and ever-dynamic study area.