Picture for Xinlong Wang

Xinlong Wang

CI-VID: A Coherent Interleaved Text-Video Dataset

Add code
Jul 02, 2025
Viaarxiv icon

Unified Vision-Language-Action Model

Add code
Jun 24, 2025
Viaarxiv icon

OmniGen2: Exploration to Advanced Multimodal Generation

Add code
Jun 23, 2025
Viaarxiv icon

MorphSAM: Learning the Morphological Prompts from Atlases for Spine Image Segmentation

Add code
Jun 16, 2025
Viaarxiv icon

Audio-Sync Video Generation with Multi-Stream Temporal Control

Add code
Jun 09, 2025
Viaarxiv icon

End-to-End Vision Tokenizer Tuning

Add code
May 15, 2025
Viaarxiv icon

Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities

Add code
Apr 02, 2025
Viaarxiv icon

Image Difference Grounding with Natural Language

Add code
Apr 02, 2025
Viaarxiv icon

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

Add code
Feb 10, 2025
Figure 1 for EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Figure 2 for EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Figure 3 for EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Figure 4 for EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Viaarxiv icon

Autoregressive Video Generation without Vector Quantization

Add code
Dec 18, 2024
Viaarxiv icon