Picture for Xinlong Wang

Xinlong Wang

CI-VID: A Coherent Interleaved Text-Video Dataset

Add code
Jul 02, 2025
Viaarxiv icon

Unified Vision-Language-Action Model

Add code
Jun 24, 2025
Viaarxiv icon

OmniGen2: Exploration to Advanced Multimodal Generation

Add code
Jun 23, 2025
Viaarxiv icon

MorphSAM: Learning the Morphological Prompts from Atlases for Spine Image Segmentation

Add code
Jun 16, 2025
Viaarxiv icon

Audio-Sync Video Generation with Multi-Stream Temporal Control

Add code
Jun 09, 2025
Viaarxiv icon

End-to-End Vision Tokenizer Tuning

Add code
May 15, 2025
Viaarxiv icon

Image Difference Grounding with Natural Language

Add code
Apr 02, 2025
Viaarxiv icon

Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities

Add code
Apr 02, 2025
Viaarxiv icon

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

Add code
Feb 10, 2025
Figure 1 for EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Figure 2 for EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Figure 3 for EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Figure 4 for EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Viaarxiv icon

Autoregressive Video Generation without Vector Quantization

Add code
Dec 18, 2024
Viaarxiv icon