Picture for Zhicheng Huang

Zhicheng Huang

PixelLM: Pixel Reasoning with Large Multimodal Model

Add code
Dec 04, 2023
Figure 1 for PixelLM: Pixel Reasoning with Large Multimodal Model
Figure 2 for PixelLM: Pixel Reasoning with Large Multimodal Model
Figure 3 for PixelLM: Pixel Reasoning with Large Multimodal Model
Figure 4 for PixelLM: Pixel Reasoning with Large Multimodal Model
Viaarxiv icon

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Add code
May 22, 2023
Figure 1 for VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Figure 2 for VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Figure 3 for VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Figure 4 for VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Viaarxiv icon

CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition

Add code
Jan 15, 2023
Figure 1 for CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition
Figure 2 for CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition
Figure 3 for CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition
Viaarxiv icon

Contrastive Masked Autoencoders are Stronger Vision Learners

Add code
Jul 27, 2022
Figure 1 for Contrastive Masked Autoencoders are Stronger Vision Learners
Figure 2 for Contrastive Masked Autoencoders are Stronger Vision Learners
Figure 3 for Contrastive Masked Autoencoders are Stronger Vision Learners
Figure 4 for Contrastive Masked Autoencoders are Stronger Vision Learners
Viaarxiv icon

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Add code
Apr 08, 2021
Figure 1 for Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
Figure 2 for Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
Figure 3 for Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
Figure 4 for Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
Viaarxiv icon

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

Add code
Apr 02, 2020
Figure 1 for Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
Figure 2 for Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
Figure 3 for Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
Figure 4 for Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
Viaarxiv icon

Learning Rich Image Region Representation for Visual Question Answering

Add code
Oct 29, 2019
Figure 1 for Learning Rich Image Region Representation for Visual Question Answering
Viaarxiv icon