This thesis explores the central question of how to leverage temporal relations among video elements to advance video understanding. Addressing the limitations of existing methods, the work presents a five-fold contribution: (1) an automatic annotation framework that utilizes large vision-language models and a noise-robust contrastive learning objective with a subtractive angular margin; (2) a parameter-efficient fine-tuning strategy using "recurrent adapters" to capture temporal dynamics in low-data regimes; (3) the integration of State Space Layers (SSL) for efficient long-form video modeling, supported by the introduction of two new long-term benchmarks for egocentric and feature-length content; (4) a novel contrastive learning framework designed to explicitly model fine-grained relations between motions and video moments; and (5) a comprehensive empirical study on Large Vision-Language Models (LVLMs) that identifies the visual-language interface as a bottleneck for temporal reasoning, leading to a new "temporal-oriented recipe" for upscaled video understanding. Collectively, these contributions demonstrate that explicit temporal modeling significantly enhances a model's ability to represent and reason about the fluid nature of video content.