Picture for Xuehai He

Xuehai He

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Add code
Jun 12, 2024
Viaarxiv icon

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

Add code
May 30, 2024
Viaarxiv icon

FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

Add code
May 08, 2024
Viaarxiv icon

Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning

Oct 14, 2023
Figure 1 for Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning
Figure 2 for Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning
Figure 3 for Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning
Figure 4 for Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning
Viaarxiv icon

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

Add code
Oct 05, 2023
Viaarxiv icon

LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

Add code
May 24, 2023
Figure 1 for LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
Figure 2 for LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
Figure 3 for LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
Figure 4 for LayoutGPT: Compositional Visual Planning and Generation with Large Language Models
Viaarxiv icon

Discriminative Diffusion Models as Few-shot Vision and Language Learners

May 18, 2023
Figure 1 for Discriminative Diffusion Models as Few-shot Vision and Language Learners
Figure 2 for Discriminative Diffusion Models as Few-shot Vision and Language Learners
Figure 3 for Discriminative Diffusion Models as Few-shot Vision and Language Learners
Figure 4 for Discriminative Diffusion Models as Few-shot Vision and Language Learners
Viaarxiv icon

Multimodal Graph Transformer for Multimodal Question Answering

Apr 30, 2023
Figure 1 for Multimodal Graph Transformer for Multimodal Question Answering
Figure 2 for Multimodal Graph Transformer for Multimodal Question Answering
Figure 3 for Multimodal Graph Transformer for Multimodal Question Answering
Figure 4 for Multimodal Graph Transformer for Multimodal Question Answering
Viaarxiv icon

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Add code
Dec 09, 2022
Figure 1 for Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
Figure 2 for Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
Figure 3 for Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
Figure 4 for Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
Viaarxiv icon

ComCLIP: Training-Free Compositional Image and Text Matching

Nov 25, 2022
Figure 1 for ComCLIP: Training-Free Compositional Image and Text Matching
Figure 2 for ComCLIP: Training-Free Compositional Image and Text Matching
Figure 3 for ComCLIP: Training-Free Compositional Image and Text Matching
Figure 4 for ComCLIP: Training-Free Compositional Image and Text Matching
Viaarxiv icon