Abstract:Multimodal out-of-context news is a type of misinformation in which the image is used outside of its original context. Many existing works have leveraged multimodal large language models (MLLMs) for detecting out-of-context news. However, observing the limited zero-shot performance of smaller MLLMs, they generally require label-rich fine-tuning and/or expensive API calls to GPT models to improve the performance, which is impractical in low-resource scenarios. In contrast, we aim to improve the performance of small MLLMs in a more label-efficient and cost-effective manner. To this end, we first prompt multiple teacher MLLMs to generate both label predictions and corresponding rationales, which collectively serve as the teachers' knowledge. We then introduce a two-stage knowledge distillation framework to transfer this knowledge to a student MLLM. In Stage 1, we apply LoRA fine-tuning to the student model using all training data. In Stage 2, we further fine-tune the student model using both LoRA fine-tuning and DPO on the data points where teachers' predictions conflict. This two-stage strategy reduces annotation costs and helps the student model uncover subtle patterns in more challenging cases. Experimental results demonstrate that our approach achieves state-of-the-art performance using less than 10% labeled data.
Abstract:Multimodal out-of-context news is a common type of misinformation on online media platforms. This involves posting a caption, alongside an invalid out-of-context news image. Reflecting its importance, researchers have developed models to detect such misinformation. However, a common limitation of these models is that they only consider the scenario where pre-labeled data is available for each domain, failing to address the out-of-context news detection on unlabeled domains (e.g., unverified news on new topics or agencies). In this work, we therefore focus on domain adaptive out-of-context news detection. In order to effectively adapt the detection model to unlabeled news topics or agencies, we propose ConDA-TTA (Contrastive Domain Adaptation with Test-Time Adaptation) which applies contrastive learning and maximum mean discrepancy (MMD) to learn the domain-invariant feature. In addition, it leverages target domain statistics during test-time to further assist domain adaptation. Experimental results show that our approach outperforms baselines in 5 out of 7 domain adaptation settings on two public datasets, by as much as 2.93% in F1 and 2.08% in accuracy.