Abstract:As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a "levels x laws" taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.
Abstract:Music profoundly enhances video production by improving quality, engagement, and emotional resonance, sparking growing interest in video-to-music generation. Despite recent advances, existing approaches remain limited in specific scenarios or undervalue the visual dynamics. To address these limitations, we focus on tackling the complexity of dynamics and resolving temporal misalignment between video and music representations. To this end, we propose DyViM, a novel framework to enhance dynamics modeling for video-to-music generation. Specifically, we extract frame-wise dynamics features via a simplified motion encoder inherited from optical flow methods, followed by a self-attention module for aggregation within frames. These dynamic features are then incorporated to extend existing music tokens for temporal alignment. Additionally, high-level semantics are conveyed through a cross-attention mechanism, and an annealing tuning strategy benefits to fine-tune well-trained music decoders efficiently, therefore facilitating seamless adaptation. Extensive experiments demonstrate DyViM's superiority over state-of-the-art (SOTA) methods.




Abstract:Despite the recent emergence of video captioning models, how to generate vivid, fine-grained video descriptions based on the background knowledge (i.e., long and informative commentary about the domain-specific scenes with appropriate reasoning) is still far from being solved, which however has great applications such as automatic sports narrative. In this paper, we present GOAL, a benchmark of over 8.9k soccer video clips, 22k sentences, and 42k knowledge triples for proposing a challenging new task setting as Knowledge-grounded Video Captioning (KGVC). Moreover, we conduct experimental adaption of existing methods to show the difficulty and potential directions for solving this valuable and applicable task.