Picture for Xinlong Wang

Xinlong Wang

Emu3.5: Native Multimodal Models are World Learners

Add code
Oct 30, 2025
Viaarxiv icon

Thor: Towards Human-Level Whole-Body Reactions for Intense Contact-Rich Environments

Add code
Oct 30, 2025
Viaarxiv icon

BrainMCLIP: Brain Image Decoding with Multi-Layer feature Fusion of CLIP

Add code
Oct 22, 2025
Viaarxiv icon

CI-VID: A Coherent Interleaved Text-Video Dataset

Add code
Jul 02, 2025
Viaarxiv icon

Unified Vision-Language-Action Model

Add code
Jun 24, 2025
Figure 1 for Unified Vision-Language-Action Model
Figure 2 for Unified Vision-Language-Action Model
Figure 3 for Unified Vision-Language-Action Model
Figure 4 for Unified Vision-Language-Action Model
Viaarxiv icon

OmniGen2: Exploration to Advanced Multimodal Generation

Add code
Jun 23, 2025
Viaarxiv icon

MorphSAM: Learning the Morphological Prompts from Atlases for Spine Image Segmentation

Add code
Jun 16, 2025
Viaarxiv icon

Audio-Sync Video Generation with Multi-Stream Temporal Control

Add code
Jun 09, 2025
Viaarxiv icon

End-to-End Vision Tokenizer Tuning

Add code
May 15, 2025
Viaarxiv icon

Image Difference Grounding with Natural Language

Add code
Apr 02, 2025
Viaarxiv icon