Picture for Xuan Dong

Xuan Dong

Gene

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

Add code
Nov 18, 2025
Viaarxiv icon

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Add code
May 29, 2025
Figure 1 for ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Figure 2 for ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Figure 3 for ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Figure 4 for ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Viaarxiv icon

Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency

Add code
May 20, 2025
Viaarxiv icon

Multi-Grained Compositional Visual Clue Learning for Image Intent Recognition

Add code
Apr 25, 2025
Viaarxiv icon

Consistent Diffusion: Denoising Diffusion Model with Data-Consistent Training for Image Restoration

Add code
Dec 17, 2024
Figure 1 for Consistent Diffusion: Denoising Diffusion Model with Data-Consistent Training for Image Restoration
Figure 2 for Consistent Diffusion: Denoising Diffusion Model with Data-Consistent Training for Image Restoration
Figure 3 for Consistent Diffusion: Denoising Diffusion Model with Data-Consistent Training for Image Restoration
Figure 4 for Consistent Diffusion: Denoising Diffusion Model with Data-Consistent Training for Image Restoration
Viaarxiv icon

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

Add code
Dec 12, 2024
Figure 1 for PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Figure 2 for PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Figure 3 for PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Figure 4 for PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
Viaarxiv icon

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

Add code
Jun 11, 2024
Figure 1 for Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Figure 2 for Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Figure 3 for Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Figure 4 for Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Viaarxiv icon

ReWiTe: Realistic Wide-angle and Telephoto Dual Camera Fusion Dataset via Beam Splitter Camera Rig

Add code
Apr 16, 2024
Figure 1 for ReWiTe: Realistic Wide-angle and Telephoto Dual Camera Fusion Dataset via Beam Splitter Camera Rig
Figure 2 for ReWiTe: Realistic Wide-angle and Telephoto Dual Camera Fusion Dataset via Beam Splitter Camera Rig
Figure 3 for ReWiTe: Realistic Wide-angle and Telephoto Dual Camera Fusion Dataset via Beam Splitter Camera Rig
Figure 4 for ReWiTe: Realistic Wide-angle and Telephoto Dual Camera Fusion Dataset via Beam Splitter Camera Rig
Viaarxiv icon

View Transition based Dual Camera Image Fusion

Add code
Dec 18, 2023
Viaarxiv icon

A Pyramid Recurrent Network for Predicting Crowdsourced Speech-Quality Ratings of Real-World Signals

Add code
Jul 31, 2020
Figure 1 for A Pyramid Recurrent Network for Predicting Crowdsourced Speech-Quality Ratings of Real-World Signals
Figure 2 for A Pyramid Recurrent Network for Predicting Crowdsourced Speech-Quality Ratings of Real-World Signals
Figure 3 for A Pyramid Recurrent Network for Predicting Crowdsourced Speech-Quality Ratings of Real-World Signals
Figure 4 for A Pyramid Recurrent Network for Predicting Crowdsourced Speech-Quality Ratings of Real-World Signals
Viaarxiv icon