Picture for Shuai Bai

Shuai Bai

Qwen-Image-2.0 Technical Report

Add code
May 11, 2026
Viaarxiv icon

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

Add code
May 05, 2026
Viaarxiv icon

GenMask: Adapting DiT for Segmentation via Direct Mask

Add code
Mar 25, 2026
Viaarxiv icon

Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

Add code
Mar 18, 2026
Viaarxiv icon

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

Add code
Feb 15, 2026
Viaarxiv icon

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Add code
Jan 08, 2026
Viaarxiv icon

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Add code
Jan 06, 2026
Viaarxiv icon

VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference

Add code
Dec 18, 2025
Viaarxiv icon

Revisiting Multimodal Positional Encoding in Vision-Language Models

Add code
Oct 27, 2025
Viaarxiv icon

FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

Add code
Sep 11, 2025
Viaarxiv icon