Abstract:Streaming video understanding requires processing unbounded video streams with limited memory and computation, posing two key challenges. First, continuously constructing new and evicting old key-value(KV) caches is required for unbounded streams. Secondly, due to the high cost of collecting and training on unbounded streams, models must learn from short sequences while generalizing to long streams. Existing streaming VideoVLLMs fail to scale to unbounded video streams or focus on cache reuse strategies, leaving the impact of cache construction underexplored. In this paper, we propose Decoupled Streaming Cache(DSCache), a training-free cache construction mechanism that adapts pretrained offline models to streaming settings. DSCache maintains a cumulative past KV cache while constructing a separate instant cache on-demand, decoupled from past caches to preserve the informativeness of recent inputs. To enable position extrapolation beyond the training length, DSCache further incorporates a position-agnostic encoding strategy, ensuring KV caches to support unseen positions and preventing position overflow. Experiments on Streaming Video QA benchmarks demonstrate DSCache's state-of-the-art performance, with an average 2.5% accuracy gains over prior methods.
Abstract:Streaming video models should respond the moment an event unfolds, not after the moment has passed. Yet existing online VideoQA benchmarks remain largely retrospective. They pause the video at fixed timestamps, pose questions about current or past events, and score models only at those moments. This protocol leaves streaming predictions untested. To close this gap, we introduce SPOT-Bench, featuring multi-turn proactive queries that evaluate general streaming perception and assistive capabilities required by an always-on, real-time assistant. SPOT-Bench comes with Timeliness-F1, a consolidated metric that measures streaming predictions by their temporal precision and balanced coverage across the entire video. Our benchmark reveals: (i) offline models detect events reliably but spam predictions unprompted; (ii) post-training for silence reduces spamming but induces unresponsiveness; (iii) half of the streaming video expects no response, which we term dead-time - compute spent here does not affect response latency. These findings motivate AsynKV, a training-free streaming adaptation of offline models, that retains their event perception while improving their streaming behavior. AsynKV features a long-short term memory, utilized efficiently by scaling compute during dead-time. It serves as a strong baseline on SPOT-Bench, outperforming existing streaming models, and achieves state-of-the-art on retrospective benchmarks.
Abstract:Multimodal Large Language Models (MLLMs) have advanced open-world action understanding and can be adapted as generative classifiers for closed-set settings by autoregressively generating action labels as text. However, this approach is inefficient, and shared subwords across action labels introduce semantic overlap, leading to ambiguity in generation. In contrast, discriminative classifiers learn task-specific representations with clear decision boundaries, enabling efficient one-step classification without autoregressive decoding. We first compare generative and discriminative classifiers with MLLMs for closed-set action understanding, revealing the superior accuracy and efficiency of the latter. To bridge the performance gap, we design strategies that elevate generative classifiers toward performance comparable with discriminative ones. Furthermore, we show that generative modeling can complement discriminative classifiers, leading to better performance while preserving efficiency. To this end, we propose Generation-Assisted Discriminative~(GAD) classifier for closed-set action understanding. GAD operates only during fine-tuning, preserving full compatibility with MLLM pretraining. Extensive experiments on temporal action understanding benchmarks demonstrate that GAD improves both accuracy and efficiency over generative methods, achieving state-of-the-art results on four tasks across five datasets, including an average 2.5% accuracy gain and 3x faster inference on our largest COIN benchmark.




Abstract:Procedural activity videos often exhibit a long-tailed action distribution due to varying action frequencies and durations. However, state-of-the-art temporal action segmentation methods overlook the long tail and fail to recognize tail actions. Existing long-tail methods make class-independent assumptions and struggle to identify tail classes when applied to temporal segmentation frameworks. This work proposes a novel group-wise temporal logit adjustment~(G-TLA) framework that combines a group-wise softmax formulation while leveraging activity information and action ordering for logit adjustment. The proposed framework significantly improves in segmenting tail actions without any performance loss on head actions.