Captioning


Knowledge is Power: Advancing Few-shot Action Recognition with Multimodal Semantics from MLLMs

Add code
Mar 27, 2026
Viaarxiv icon

Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric Vision

Add code
Mar 27, 2026
Viaarxiv icon

The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding

Add code
Mar 27, 2026
Viaarxiv icon

MA-Bench: Towards Fine-grained Micro-Action Understanding

Add code
Mar 27, 2026
Viaarxiv icon

Label-Free Cross-Task LoRA Merging with Null-Space Compression

Add code
Mar 27, 2026
Viaarxiv icon

Generative Score Inference for Multimodal Data

Add code
Mar 27, 2026
Viaarxiv icon

Learning to Rank Caption Chains for Video-Text Alignment

Add code
Mar 26, 2026
Viaarxiv icon

Unlocking Strong Supervision: A Data-Centric Study of General-Purpose Audio Pre-Training Methods

Add code
Mar 26, 2026
Viaarxiv icon

BFMD: A Full-Match Badminton Dense Dataset for Dense Shot Captioning

Add code
Mar 26, 2026
Viaarxiv icon

MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning

Add code
Mar 26, 2026
Viaarxiv icon