Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:A Survey of Vision-Language Pre-Trained Models

Feb 18, 2022

Yifan Du, Zikang Liu, Junyi Li, Wayne Xin Zhao

Figure 1 for A Survey of Vision-Language Pre-Trained Models

Figure 2 for A Survey of Vision-Language Pre-Trained Models

Figure 3 for A Survey of Vision-Language Pre-Trained Models

Share this with someone who'll enjoy it:

Abstract:As Transformer evolved, pre-trained models have advanced at a breakneck pace in recent years. They have dominated the mainstream techniques in natural language processing (NLP) and computer vision (CV). How to adapt pre-training to the field of Vision-and-Language (V-L) learning and improve the performance on downstream tasks becomes a focus of multimodal learning. In this paper, we review the recent progress in Vision-Language Pre-Trained Models (VL-PTMs). As the core content, we first briefly introduce several ways to encode raw images and texts to single-modal embeddings before pre-training. Then, we dive into the mainstream architectures of VL-PTMs in modeling the interaction between text and image representations. We further present widely-used pre-training tasks, after which we introduce some common downstream tasks. We finally conclude this paper and present some promising research directions. Our survey aims to provide multimodal researchers a synthesis and pointer to related research.

* Under review

View paper on

Share this with someone who'll enjoy it:

Title:A Survey of Vision-Language Pre-Trained Models

Paper and Code