Abstract:To mitigate the threat of misinformation, multimodal manipulation localization has garnered growing attention. Consider that current methods rely on costly and time-consuming fine-grained annotations, such as patch/token-level annotations. This paper proposes a novel framework named Coupling Implicit and Explicit Cues (CIEC), which aims to achieve multimodal weakly-supervised manipulation localization for image-text pairs utilizing only coarse-grained image/sentence-level annotations. It comprises two branches, image-based and text-based weakly-supervised localization. For the former, we devise the Textual-guidance Refine Patch Selection (TRPS) module. It integrates forgery cues from both visual and textual perspectives to lock onto suspicious regions aided by spatial priors. Followed by the background silencing and spatial contrast constraints to suppress interference from irrelevant areas. For the latter, we devise the Visual-deviation Calibrated Token Grounding (VCTG) module. It focuses on meaningful content words and leverages relative visual bias to assist token localization. Followed by the asymmetric sparse and semantic consistency constraints to mitigate label noise and ensure reliability. Extensive experiments demonstrate the effectiveness of our CIEC, yielding results comparable to fully supervised methods on several evaluation metrics.
Abstract:Multimodal fake news detection aims to automatically identify real or fake news, thereby mitigating the adverse effects caused by such misinformation. Although prevailing approaches have demonstrated their effectiveness, challenges persist in cross-modal feature fusion and refinement for classification. To address this, we present a residual-aware compensation network with multi-granularity constraints (RaCMC) for fake news detection, that aims to sufficiently interact and fuse cross-modal features while amplifying the differences between real and fake news. First, a multiscale residual-aware compensation module is designed to interact and fuse features at different scales, and ensure both the consistency and exclusivity of feature interaction, thus acquiring high-quality features. Second, a multi-granularity constraints module is implemented to limit the distribution of both the news overall and the image-text pairs within the news, thus amplifying the differences between real and fake news at the news and feature levels. Finally, a dominant feature fusion reasoning module is developed to comprehensively evaluate news authenticity from the perspectives of both consistency and inconsistency. Experiments on three public datasets, including Weibo17, Politifact and GossipCop, reveal the superiority of the proposed method.