Picture for Xizhou Zhu

Xizhou Zhu

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Add code
Aug 05, 2024
Figure 1 for MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Figure 2 for MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Figure 3 for MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Figure 4 for MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Viaarxiv icon

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

Add code
Jul 22, 2024
Figure 1 for MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
Figure 2 for MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
Figure 3 for MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
Figure 4 for MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
Viaarxiv icon

TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries

Add code
Jul 09, 2024
Figure 1 for TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries
Figure 2 for TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries
Figure 3 for TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries
Figure 4 for TVR-Ranking: A Dataset for Ranked Video Moment Retrieval with Imprecise Queries
Viaarxiv icon

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Add code
Jun 13, 2024
Figure 1 for OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Figure 2 for OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Figure 3 for OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Figure 4 for OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Viaarxiv icon

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Add code
Jun 12, 2024
Viaarxiv icon

OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Add code
Jun 12, 2024
Figure 1 for OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Figure 2 for OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Figure 3 for OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Figure 4 for OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Viaarxiv icon

Needle In A Multimodal Haystack

Add code
Jun 11, 2024
Viaarxiv icon

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

Add code
Jun 11, 2024
Figure 1 for Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Figure 2 for Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Figure 3 for Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Figure 4 for Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Viaarxiv icon

Learning 1D Causal Visual Representation with De-focus Attention Networks

Add code
Jun 06, 2024
Viaarxiv icon

Parameter-Inverted Image Pyramid Networks

Add code
Jun 06, 2024
Figure 1 for Parameter-Inverted Image Pyramid Networks
Figure 2 for Parameter-Inverted Image Pyramid Networks
Figure 3 for Parameter-Inverted Image Pyramid Networks
Figure 4 for Parameter-Inverted Image Pyramid Networks
Viaarxiv icon