Picture for Hongxu Yin

Hongxu Yin

Celine

FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

Add code
Dec 11, 2025
Viaarxiv icon

NVIDIA Nemotron Nano V2 VL

Add code
Nov 07, 2025
Viaarxiv icon

3D Aware Region Prompted Vision Language Model

Add code
Sep 16, 2025
Figure 1 for 3D Aware Region Prompted Vision Language Model
Figure 2 for 3D Aware Region Prompted Vision Language Model
Figure 3 for 3D Aware Region Prompted Vision Language Model
Figure 4 for 3D Aware Region Prompted Vision Language Model
Viaarxiv icon

Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations

Add code
Aug 25, 2025
Figure 1 for Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations
Figure 2 for Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations
Figure 3 for Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations
Figure 4 for Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations
Viaarxiv icon

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Add code
Jul 16, 2025
Viaarxiv icon

Scaling RL to Long Videos

Add code
Jul 10, 2025
Viaarxiv icon

CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

Add code
Apr 17, 2025
Figure 1 for CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
Figure 2 for CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
Figure 3 for CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
Figure 4 for CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training
Viaarxiv icon

Scaling Vision Pre-Training to 4K Resolution

Add code
Mar 25, 2025
Viaarxiv icon

Token-Efficient Long Video Understanding for Multimodal LLMs

Add code
Mar 06, 2025
Figure 1 for Token-Efficient Long Video Understanding for Multimodal LLMs
Figure 2 for Token-Efficient Long Video Understanding for Multimodal LLMs
Figure 3 for Token-Efficient Long Video Understanding for Multimodal LLMs
Figure 4 for Token-Efficient Long Video Understanding for Multimodal LLMs
Viaarxiv icon

WorldModelBench: Judging Video Generation Models As World Models

Add code
Feb 28, 2025
Figure 1 for WorldModelBench: Judging Video Generation Models As World Models
Figure 2 for WorldModelBench: Judging Video Generation Models As World Models
Figure 3 for WorldModelBench: Judging Video Generation Models As World Models
Figure 4 for WorldModelBench: Judging Video Generation Models As World Models
Viaarxiv icon