Picture for Zhe Gan

Zhe Gan

Prompting GPT-3 To Be Reliable

Add code
Oct 17, 2022
Figure 1 for Prompting GPT-3 To Be Reliable
Figure 2 for Prompting GPT-3 To Be Reliable
Figure 3 for Prompting GPT-3 To Be Reliable
Figure 4 for Prompting GPT-3 To Be Reliable
Viaarxiv icon

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Add code
Sep 04, 2022
Figure 1 for An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Figure 2 for An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Figure 3 for An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Figure 4 for An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
Viaarxiv icon

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

Add code
Jul 20, 2022
Figure 1 for NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis
Figure 2 for NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis
Figure 3 for NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis
Figure 4 for NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis
Viaarxiv icon

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Add code
Jun 15, 2022
Figure 1 for Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Figure 2 for Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Figure 3 for Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Figure 4 for Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Viaarxiv icon

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Add code
Jun 14, 2022
Figure 1 for LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Figure 2 for LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Figure 3 for LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Figure 4 for LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling
Viaarxiv icon

GIT: A Generative Image-to-text Transformer for Vision and Language

Add code
May 31, 2022
Figure 1 for GIT: A Generative Image-to-text Transformer for Vision and Language
Figure 2 for GIT: A Generative Image-to-text Transformer for Vision and Language
Figure 3 for GIT: A Generative Image-to-text Transformer for Vision and Language
Figure 4 for GIT: A Generative Image-to-text Transformer for Vision and Language
Viaarxiv icon

K-LITE: Learning Transferable Visual Models with External Knowledge

Add code
Apr 20, 2022
Figure 1 for K-LITE: Learning Transferable Visual Models with External Knowledge
Figure 2 for K-LITE: Learning Transferable Visual Models with External Knowledge
Figure 3 for K-LITE: Learning Transferable Visual Models with External Knowledge
Figure 4 for K-LITE: Learning Transferable Visual Models with External Knowledge
Viaarxiv icon

Injecting Semantic Concepts into End-to-End Image Captioning

Add code
Dec 09, 2021
Figure 1 for Injecting Semantic Concepts into End-to-End Image Captioning
Figure 2 for Injecting Semantic Concepts into End-to-End Image Captioning
Figure 3 for Injecting Semantic Concepts into End-to-End Image Captioning
Figure 4 for Injecting Semantic Concepts into End-to-End Image Captioning
Viaarxiv icon

MLP Architectures for Vision-and-Language Modeling: An Empirical Study

Add code
Dec 08, 2021
Figure 1 for MLP Architectures for Vision-and-Language Modeling: An Empirical Study
Figure 2 for MLP Architectures for Vision-and-Language Modeling: An Empirical Study
Figure 3 for MLP Architectures for Vision-and-Language Modeling: An Empirical Study
Figure 4 for MLP Architectures for Vision-and-Language Modeling: An Empirical Study
Viaarxiv icon

SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning

Add code
Nov 25, 2021
Figure 1 for SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Figure 2 for SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Figure 3 for SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Figure 4 for SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Viaarxiv icon