Picture for Zhe Gan

Zhe Gan

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Add code
Jul 02, 2024
Viaarxiv icon

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

Add code
Jul 01, 2024
Figure 1 for MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Figure 2 for MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Figure 3 for MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Figure 4 for MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Viaarxiv icon

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Add code
Apr 11, 2024
Viaarxiv icon

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Add code
Apr 08, 2024
Viaarxiv icon

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Add code
Mar 22, 2024
Figure 1 for MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Figure 2 for MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Figure 3 for MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Figure 4 for MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Viaarxiv icon

How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts

Add code
Feb 20, 2024
Viaarxiv icon

InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models

Add code
Dec 21, 2023
Viaarxiv icon

Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

Add code
Nov 27, 2023
Figure 1 for Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation
Figure 2 for Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation
Figure 3 for Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation
Figure 4 for Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation
Viaarxiv icon

From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions

Add code
Oct 11, 2023
Figure 1 for From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions
Figure 2 for From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions
Figure 3 for From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions
Figure 4 for From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions
Viaarxiv icon

Ferret: Refer and Ground Anything Anywhere at Any Granularity

Add code
Oct 11, 2023
Figure 1 for Ferret: Refer and Ground Anything Anywhere at Any Granularity
Figure 2 for Ferret: Refer and Ground Anything Anywhere at Any Granularity
Figure 3 for Ferret: Refer and Ground Anything Anywhere at Any Granularity
Figure 4 for Ferret: Refer and Ground Anything Anywhere at Any Granularity
Viaarxiv icon