Picture for Ronghang Hu

Ronghang Hu

SAM 2: Segment Anything in Images and Videos

Add code
Aug 01, 2024
Viaarxiv icon

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

Add code
Jan 02, 2023
Viaarxiv icon

UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding

Add code
Dec 01, 2022
Viaarxiv icon

Scaling Language-Image Pre-training via Masking

Add code
Dec 01, 2022
Viaarxiv icon

Exploring Long-Sequence Masked Autoencoders

Add code
Oct 13, 2022
Figure 1 for Exploring Long-Sequence Masked Autoencoders
Figure 2 for Exploring Long-Sequence Masked Autoencoders
Figure 3 for Exploring Long-Sequence Masked Autoencoders
Figure 4 for Exploring Long-Sequence Masked Autoencoders
Viaarxiv icon

FLAVA: A Foundational Language And Vision Alignment Model

Add code
Dec 08, 2021
Figure 1 for FLAVA: A Foundational Language And Vision Alignment Model
Figure 2 for FLAVA: A Foundational Language And Vision Alignment Model
Figure 3 for FLAVA: A Foundational Language And Vision Alignment Model
Figure 4 for FLAVA: A Foundational Language And Vision Alignment Model
Viaarxiv icon

Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer

Add code
Feb 22, 2021
Figure 1 for Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer
Figure 2 for Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer
Figure 3 for Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer
Figure 4 for Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer
Viaarxiv icon

Worldsheet: Wrapping the World in a 3D Sheet for View Synthesis from a Single Image

Add code
Dec 17, 2020
Figure 1 for Worldsheet: Wrapping the World in a 3D Sheet for View Synthesis from a Single Image
Figure 2 for Worldsheet: Wrapping the World in a 3D Sheet for View Synthesis from a Single Image
Figure 3 for Worldsheet: Wrapping the World in a 3D Sheet for View Synthesis from a Single Image
Figure 4 for Worldsheet: Wrapping the World in a 3D Sheet for View Synthesis from a Single Image
Viaarxiv icon

TextCaps: a Dataset for Image Captioning with Reading Comprehension

Add code
Mar 24, 2020
Figure 1 for TextCaps: a Dataset for Image Captioning with Reading Comprehension
Figure 2 for TextCaps: a Dataset for Image Captioning with Reading Comprehension
Figure 3 for TextCaps: a Dataset for Image Captioning with Reading Comprehension
Figure 4 for TextCaps: a Dataset for Image Captioning with Reading Comprehension
Viaarxiv icon

Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Add code
Dec 05, 2019
Figure 1 for Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA
Figure 2 for Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA
Figure 3 for Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA
Figure 4 for Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA
Viaarxiv icon