Picture for Xiaoyi Dong

Xiaoyi Dong

Think Visually, Reason Textually: Vision-Language Synergy in ARC

Add code
Nov 19, 2025
Figure 1 for Think Visually, Reason Textually: Vision-Language Synergy in ARC
Figure 2 for Think Visually, Reason Textually: Vision-Language Synergy in ARC
Figure 3 for Think Visually, Reason Textually: Vision-Language Synergy in ARC
Figure 4 for Think Visually, Reason Textually: Vision-Language Synergy in ARC
Viaarxiv icon

Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

Add code
Oct 31, 2025
Viaarxiv icon

SPARK: Synergistic Policy And Reward Co-Evolving Framework

Add code
Sep 26, 2025
Figure 1 for SPARK: Synergistic Policy And Reward Co-Evolving Framework
Figure 2 for SPARK: Synergistic Policy And Reward Co-Evolving Framework
Figure 3 for SPARK: Synergistic Policy And Reward Co-Evolving Framework
Figure 4 for SPARK: Synergistic Policy And Reward Co-Evolving Framework
Viaarxiv icon

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

Add code
Sep 26, 2025
Viaarxiv icon

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

Add code
Sep 26, 2025
Viaarxiv icon

CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning

Add code
Aug 27, 2025
Viaarxiv icon

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience

Add code
Aug 06, 2025
Viaarxiv icon

Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models

Add code
Aug 01, 2025
Figure 1 for Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models
Figure 2 for Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models
Figure 3 for Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models
Figure 4 for Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models
Viaarxiv icon

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

Add code
Jun 24, 2025
Figure 1 for ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
Figure 2 for ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
Figure 3 for ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
Figure 4 for ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
Viaarxiv icon

Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings

Add code
Jun 05, 2025
Viaarxiv icon