Picture for Shuai Bai

Shuai Bai

GenMask: Adapting DiT for Segmentation via Direct Mask

Add code
Mar 25, 2026
Viaarxiv icon

Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

Add code
Mar 18, 2026
Viaarxiv icon

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

Add code
Feb 15, 2026
Viaarxiv icon

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Add code
Jan 08, 2026
Viaarxiv icon

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Add code
Jan 06, 2026
Viaarxiv icon

VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference

Add code
Dec 18, 2025
Viaarxiv icon

Revisiting Multimodal Positional Encoding in Vision-Language Models

Add code
Oct 27, 2025
Viaarxiv icon

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

Add code
Sep 11, 2025
Viaarxiv icon

Qwen2.5-Omni Technical Report

Add code
Mar 26, 2025
Figure 1 for Qwen2.5-Omni Technical Report
Figure 2 for Qwen2.5-Omni Technical Report
Figure 3 for Qwen2.5-Omni Technical Report
Figure 4 for Qwen2.5-Omni Technical Report
Viaarxiv icon

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think

Add code
Feb 27, 2025
Viaarxiv icon