Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Dec 02, 2024

Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, Zeynep Akata

Figure 1 for COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Figure 2 for COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Figure 3 for COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Figure 4 for COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Share this with someone who'll enjoy it:

Abstract:Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks.

View paper on

Share this with someone who'll enjoy it:

Title:COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Paper and Code