Abstract:Deepfake detection is widely framed as a machine learning problem, yet how humans and AI detectors compare under realistic conditions remains poorly understood. We evaluate 200 human participants and 95 state-of-the-art AI detectors across two datasets: DF40, a standard benchmark, and CharadesDF, a novel dataset of videos of everyday activities. CharadesDF was recorded using mobile phones leading to low/moderate quality videos compared to the more professionally captured DF40. Humans outperform AI detectors on both datasets, with the gap widening in the case of CharadesDF where AI accuracy collapses to near chance (0.537) while humans maintain robust performance (0.784). Human and AI errors are complementary: humans miss high-quality deepfakes while AI detectors flag authentic videos as fake, and hybrid human-AI ensembles reduce high-confidence errors. These findings suggest that effective real-world deepfake detection, especially in non-professionally produced videos, requires human-AI collaboration rather than AI algorithms alone.
Abstract:Current medical vision-language models (VLMs) process volumetric brain MRI using 2D slice-based approximations, fragmenting the spatial context required for accurate neuroradiological interpretation. We developed \textbf{Brain3D}, a staged vision-language framework for automated radiology report generation from 3D brain tumor MRI. Our approach inflates a pretrained 2D medical encoder into a native 3D architecture and progressively aligns it with a causal language model through three stages: contrastive grounding, supervised projector warmup, and LoRA-based linguistic specialization. Unlike generalist 3D medical VLMs, \textbf{Brain3D} is tailored to neuroradiology, where hemispheric laterality, tumor infiltration patterns, and anatomical localization are critical. Evaluated on 468 subjects (BraTS pathological cases plus healthy controls), our model achieves a Clinical Pathology F1 of 0.951 versus 0.413 for a strong 2D baseline while maintaining perfect specificity on healthy scans. The staged alignment proves essential: contrastive grounding establishes visual-textual correspondence, projector warmup stabilizes conditioning, and LoRA adaptation shifts output from verbose captions to structured clinical reports\footnote{Our code is publicly available for transparency and reproducibility
Abstract:Humans use context to assess the veracity of information. However, current audio deepfake detectors only analyze the audio file without considering either context or transcripts. We create and analyze a Journalist-provided Deepfake Dataset (JDD) of 255 public deepfakes which were primarily contributed by over 70 journalists since early 2024. We also generate a synthetic audio dataset (SYN) of dead public figures and propose a novel Context-based Audio Deepfake Detector (CADD) architecture. In addition, we evaluate performance on two large-scale datasets: ITW and P$^2$V. We show that sufficient context and/or the transcript can significantly improve the efficacy of audio deepfake detectors. Performance (measured via F1 score, AUC, and EER) of multiple baseline audio deepfake detectors and traditional classifiers can be improved by 5%-37.58% in F1-score, 3.77%-42.79% in AUC, and 6.17%-47.83% in EER. We additionally show that CADD, via its use of context and/or transcripts, is more robust to 5 adversarial evasion strategies, limiting performance degradation to an average of just -0.71% across all experiments. Code, models, and datasets are available at our project page: https://sites.northwestern.edu/nsail/cadd-context-based-audio-deepfake-detection (access restricted during review).



Abstract:Medical question answering systems face deployment challenges including hallucinations, bias, computational demands, privacy concerns, and the need for specialized expertise across diverse domains. Here, we present SOLVE-Med, a multi-agent architecture combining domain-specialized small language models for complex medical queries. The system employs a Router Agent for dynamic specialist selection, ten specialized models (1B parameters each) fine-tuned on specific medical domains, and an Orchestrator Agent that synthesizes responses. Evaluated on Italian medical forum data across ten specialties, SOLVE-Med achieves superior performance with ROUGE-1 of 0.301 and BERTScore F1 of 0.697, outperforming standalone models up to 14B parameters while enabling local deployment. Our code is publicly available on GitHub: https://github.com/PRAISELab-PicusLab/SOLVE-Med.
Abstract:Misinformation in healthcare, from vaccine hesitancy to unproven treatments, poses risks to public health and trust in medical systems. While machine learning and natural language processing have advanced automated fact-checking, validating biomedical claims remains uniquely challenging due to complex terminology, the need for domain expertise, and the critical importance of grounding in scientific evidence. We introduce CER (Combining Evidence and Reasoning), a novel framework for biomedical fact-checking that integrates scientific evidence retrieval, reasoning via large language models, and supervised veracity prediction. By integrating the text-generation capabilities of large language models with advanced retrieval techniques for high-quality biomedical scientific evidence, CER effectively mitigates the risk of hallucinations, ensuring that generated outputs are grounded in verifiable, evidence-based sources. Evaluations on expert-annotated datasets (HealthFC, BioASQ-7b, SciFact) demonstrate state-of-the-art performance and promising cross-dataset generalization. Code and data are released for transparency and reproducibility: https: //github.com/PRAISELab-PicusLab/CER.
Abstract:Misinformation in healthcare, from vaccine hesitancy to unproven treatments, poses risks to public health and trust in medical systems. While machine learning and natural language processing have advanced automated fact-checking, validating biomedical claims remains uniquely challenging due to complex terminology, the need for domain expertise, and the critical importance of grounding in scientific evidence. We introduce CER (Combining Evidence and Reasoning), a novel framework for biomedical fact-checking that integrates scientific evidence retrieval, reasoning via large language models, and supervised veracity prediction. By integrating the text-generation capabilities of large language models with advanced retrieval techniques for high-quality biomedical scientific evidence, CER effectively mitigates the risk of hallucinations, ensuring that generated outputs are grounded in verifiable, evidence-based sources. Evaluations on expert-annotated datasets (HealthFC, BioASQ-7b, SciFact) demonstrate state-of-the-art performance and promising cross-dataset generalization. Code and data are released for transparency and reproducibility: https://github.com/PRAISELab-PicusLab/CER



Abstract:Despite the huge and continuous advances in computational linguistics, the lack of annotated data for Named Entity Recognition (NER) is still a challenging issue, especially in low-resource languages and when domain knowledge is required for high-quality annotations. Recent findings in NLP show the effectiveness of cloze-style questions in enabling language models to leverage the knowledge they acquired during the pre-training phase. In our work, we propose a simple and intuitive adaptation of Pattern-Exploiting Training (PET), a recent approach which combines the cloze-questions mechanism and fine-tuning for few-shot learning: the key idea is to rephrase the NER task with patterns. Our approach achieves considerably better performance than standard fine-tuning and comparable or improved results with respect to other few-shot baselines without relying on manually annotated data or distant supervision on three benchmark datasets: NCBI-disease, BC2GM and a private Italian biomedical corpus.