Visual Understanding


OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

Add code
Feb 05, 2026
Viaarxiv icon

SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs

Add code
Feb 05, 2026
Viaarxiv icon

VLN-Pilot: Large Vision-Language Model as an Autonomous Indoor Drone Operator

Add code
Feb 05, 2026
Viaarxiv icon

Diffusion-aided Extreme Video Compression with Lightweight Semantics Guidance

Add code
Feb 05, 2026
Viaarxiv icon

SDR-CIR: Semantic Debias Retrieval Framework for Training-Free Zero-Shot Composed Image Retrieval

Add code
Feb 05, 2026
Viaarxiv icon

VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

Add code
Feb 04, 2026
Viaarxiv icon

VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation

Add code
Feb 05, 2026
Viaarxiv icon

UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos

Add code
Feb 05, 2026
Viaarxiv icon

JSynFlow: Japanese Synthesised Flowchart Visual Question Answering Dataset built with Large Language Models

Add code
Feb 05, 2026
Viaarxiv icon

Visualizing the loss landscapes of physics-informed neural networks

Add code
Feb 05, 2026
Viaarxiv icon