Visual Transcription


MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

Add code
Mar 23, 2026
Viaarxiv icon

From Plausibility to Verifiability: Risk-Controlled Generative OCR for Vision-Language Models

Add code
Mar 20, 2026
Viaarxiv icon

Anchoring Emotions in Text: Robust Multimodal Fusion for Mimicry Intensity Estimation

Add code
Mar 16, 2026
Viaarxiv icon

Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

Add code
Mar 09, 2026
Viaarxiv icon

GLM-OCR Technical Report

Add code
Mar 11, 2026
Viaarxiv icon

Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG

Add code
Mar 04, 2026
Viaarxiv icon

The USTC-NERCSLIP Systems for the CHiME-9 MCoRec Challenge

Add code
Mar 02, 2026
Viaarxiv icon

An Effective Data Augmentation Method by Asking Questions about Scene Text Images

Add code
Mar 03, 2026
Viaarxiv icon

Therapist-Robot-Patient Physical Interaction is Worth a Thousand Words: Enabling Intuitive Therapist Guidance via Remote Haptic Control

Add code
Feb 25, 2026
Viaarxiv icon

Musical Metamerism with Time--Frequency Scattering

Add code
Feb 12, 2026
Viaarxiv icon