Picture for Zi-Yi Dou

Zi-Yi Dou

VaPR -- Vision-language Preference alignment for Reasoning

Add code
Oct 02, 2025
Viaarxiv icon

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Add code
Sep 30, 2025
Viaarxiv icon

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Add code
Sep 19, 2025
Viaarxiv icon

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

Add code
Oct 10, 2024
Viaarxiv icon

Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning

Add code
Aug 07, 2024
Viaarxiv icon

Reflection-Reinforced Self-Training for Language Agents

Add code
Jun 03, 2024
Figure 1 for Reflection-Reinforced Self-Training for Language Agents
Figure 2 for Reflection-Reinforced Self-Training for Language Agents
Figure 3 for Reflection-Reinforced Self-Training for Language Agents
Figure 4 for Reflection-Reinforced Self-Training for Language Agents
Viaarxiv icon

Matryoshka Query Transformer for Large Vision-Language Models

Add code
May 29, 2024
Viaarxiv icon

Medical Vision-Language Pre-Training for Brain Abnormalities

Add code
Apr 27, 2024
Viaarxiv icon

VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models

Add code
Apr 22, 2024
Viaarxiv icon

ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos

Add code
Nov 02, 2023
Figure 1 for ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos
Figure 2 for ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos
Figure 3 for ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos
Figure 4 for ACQUIRED: A Dataset for Answering Counterfactual Questions In Real-Life Videos
Viaarxiv icon