Video Understanding


UniSurg: A Video-Native Foundation Model for Universal Understanding of Surgical Videos

Add code
Feb 05, 2026
Viaarxiv icon

Diffusion-aided Extreme Video Compression with Lightweight Semantics Guidance

Add code
Feb 05, 2026
Viaarxiv icon

RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference

Add code
Feb 05, 2026
Viaarxiv icon

Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization

Add code
Feb 05, 2026
Viaarxiv icon

VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding

Add code
Feb 04, 2026
Viaarxiv icon

VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents

Add code
Feb 04, 2026
Viaarxiv icon

Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection

Add code
Feb 05, 2026
Viaarxiv icon

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Add code
Feb 04, 2026
Viaarxiv icon

E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching

Add code
Feb 05, 2026
Viaarxiv icon

SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM

Add code
Feb 03, 2026
Viaarxiv icon