Picture for Yukang Chen

Yukang Chen

3D Aware Region Prompted Vision Language Model

Add code
Sep 16, 2025
Viaarxiv icon

Scaling RL to Long Videos

Add code
Jul 10, 2025
Viaarxiv icon

MindOmni: Unleashing Reasoning Generation in Vision Language Models with RGPO

Add code
May 19, 2025
Viaarxiv icon

TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance

Add code
Apr 23, 2025
Viaarxiv icon

WorldModelBench: Judging Video Generation Models As World Models

Add code
Feb 28, 2025
Viaarxiv icon

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Add code
Dec 12, 2024
Figure 1 for Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Figure 2 for Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Figure 3 for Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Figure 4 for Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Viaarxiv icon

NVILA: Efficient Frontier Visual Language Models

Add code
Dec 05, 2024
Figure 1 for NVILA: Efficient Frontier Visual Language Models
Figure 2 for NVILA: Efficient Frontier Visual Language Models
Figure 3 for NVILA: Efficient Frontier Visual Language Models
Figure 4 for NVILA: Efficient Frontier Visual Language Models
Viaarxiv icon

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Add code
Dec 05, 2024
Figure 1 for VisionZip: Longer is Better but Not Necessary in Vision Language Models
Figure 2 for VisionZip: Longer is Better but Not Necessary in Vision Language Models
Figure 3 for VisionZip: Longer is Better but Not Necessary in Vision Language Models
Figure 4 for VisionZip: Longer is Better but Not Necessary in Vision Language Models
Viaarxiv icon

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Add code
Aug 21, 2024
Figure 1 for LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Figure 2 for LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Figure 3 for LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Figure 4 for LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Viaarxiv icon

SEED-Story: Multimodal Long Story Generation with Large Language Model

Add code
Jul 11, 2024
Viaarxiv icon