Dense Video Captioning


Dense video captioning is the process of generating textual descriptions for multiple events in a video.

TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos

Add code
May 26, 2025
Viaarxiv icon

Pixel Motion as Universal Representation for Robot Control

Add code
May 12, 2025
Viaarxiv icon

TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action

Add code
May 02, 2025
Viaarxiv icon

TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation

Add code
Apr 24, 2025
Viaarxiv icon

LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

Add code
Apr 22, 2025
Viaarxiv icon

MR. Video: "MapReduce" is the Principle for Long Video Understanding

Add code
Apr 22, 2025
Viaarxiv icon

Perception Encoder: The best visual embeddings are not at the output of the network

Add code
Apr 17, 2025
Viaarxiv icon

VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models

Add code
Apr 10, 2025
Viaarxiv icon

Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

Add code
Mar 31, 2025
Viaarxiv icon

Watch and Learn: Leveraging Expert Knowledge and Language for Surgical Video Understanding

Add code
Mar 14, 2025
Viaarxiv icon