Picture for Xin Eric Wang

Xin Eric Wang

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

Add code
Jul 17, 2024
Viaarxiv icon

Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding

Add code
Jun 27, 2024
Viaarxiv icon

VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing

Add code
Jun 18, 2024
Figure 1 for VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing
Figure 2 for VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing
Figure 3 for VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing
Figure 4 for VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing
Viaarxiv icon

Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation

Add code
Jun 13, 2024
Figure 1 for Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation
Figure 2 for Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation
Figure 3 for Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation
Figure 4 for Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation
Viaarxiv icon

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Add code
Jun 12, 2024
Viaarxiv icon

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

Add code
May 30, 2024
Viaarxiv icon

FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

Add code
May 08, 2024
Figure 1 for FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation
Figure 2 for FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation
Figure 3 for FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation
Figure 4 for FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation
Viaarxiv icon

SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing

Add code
Apr 08, 2024
Figure 1 for SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing
Figure 2 for SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing
Figure 3 for SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing
Figure 4 for SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing
Viaarxiv icon

Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA

Add code
Jan 29, 2024
Figure 1 for Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA
Figure 2 for Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA
Figure 3 for Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA
Figure 4 for Muffin or Chihuahua? Challenging Large Vision-Language Models with Multipanel VQA
Viaarxiv icon

ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models

Add code
Oct 09, 2023
Figure 1 for ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models
Figure 2 for ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models
Figure 3 for ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models
Figure 4 for ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models
Viaarxiv icon