Alert button

A Survey of Vision-Language Pre-Trained Models

Feb 18, 2022
Yifan Du, Zikang Liu, Junyi Li, Wayne Xin Zhao

Figure 1 for A Survey of Vision-Language Pre-Trained Models
Figure 2 for A Survey of Vision-Language Pre-Trained Models
Figure 3 for A Survey of Vision-Language Pre-Trained Models

Share this with someone who'll enjoy it:

As Transformer evolved, pre-trained models have advanced at a breakneck pace in recent years. They have dominated the mainstream techniques in natural language processing (NLP) and computer vision (CV). How to adapt pre-training to the field of Vision-and-Language (V-L) learning and improve the performance on downstream tasks becomes a focus of multimodal learning. In this paper, we review the recent progress in Vision-Language Pre-Trained Models (VL-PTMs). As the core content, we first briefly introduce several ways to encode raw images and texts to single-modal embeddings before pre-training. Then, we dive into the mainstream architectures of VL-PTMs in modeling the interaction between text and image representations. We further present widely-used pre-training tasks, after which we introduce some common downstream tasks. We finally conclude this paper and present some promising research directions. Our survey aims to provide multimodal researchers a synthesis and pointer to related research.

* Under review  
View paper onarxiv icon

Share this with someone who'll enjoy it: