Picture for Yatai Ji

Yatai Ji

Grounded 3D-Aware Spatial Vision-Language Modeling

Add code
May 28, 2026
Viaarxiv icon

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

Add code
Apr 27, 2026
Viaarxiv icon

How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

Add code
Apr 09, 2026
Viaarxiv icon

Towards Autonomous UAV Visual Object Search in City Space: Benchmark and Agentic Methodology

Add code
May 14, 2025
Viaarxiv icon

DLW-CI: A Dynamic Likelihood-Weighted Cooperative Infotaxis Approach for Multi-Source Search in Urban Environments Using Consumer Drone Networks

Add code
Apr 19, 2025
Viaarxiv icon

CityEQA: A Hierarchical LLM Agent on Embodied Question Answering Benchmark in City Space

Add code
Feb 20, 2025
Viaarxiv icon

OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization

Add code
Dec 19, 2024
Figure 1 for OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization
Figure 2 for OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization
Figure 3 for OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization
Figure 4 for OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization
Viaarxiv icon

Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM

Add code
Dec 19, 2024
Figure 1 for Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM
Figure 2 for Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM
Figure 3 for Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM
Figure 4 for Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM
Viaarxiv icon

IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

Add code
Jul 10, 2024
Figure 1 for IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
Figure 2 for IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
Figure 3 for IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
Figure 4 for IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
Viaarxiv icon

PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

Add code
Jun 20, 2024
Figure 1 for PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents
Figure 2 for PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents
Figure 3 for PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents
Figure 4 for PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents
Viaarxiv icon