Picture for Jian Luan

Jian Luan

ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

Add code
Apr 16, 2026
Viaarxiv icon

Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

Add code
Apr 15, 2026
Viaarxiv icon

Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models

Add code
Mar 31, 2026
Viaarxiv icon

ACAVCaps: Enabling large-scale training for fine-grained and diverse audio understanding

Add code
Mar 25, 2026
Viaarxiv icon

The Interspeech 2026 Audio Encoder Capability Challenge for Large Audio Language Models

Add code
Mar 24, 2026
Viaarxiv icon

Borderless Long Speech Synthesis

Add code
Mar 20, 2026
Viaarxiv icon

ExPosST: Explicit Positioning with Adaptive Masking for LLM-Based Simultaneous Machine Translation

Add code
Mar 16, 2026
Viaarxiv icon

Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

Add code
Mar 12, 2026
Viaarxiv icon

IMTBench: A Multi-Scenario Cross-Modal Collaborative Evaluation Benchmark for In-Image Machine Translation

Add code
Mar 11, 2026
Viaarxiv icon

From Ideal to Real: Stable Video Object Removal under Imperfect Conditions

Add code
Mar 10, 2026
Viaarxiv icon