Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mutian Tong

PointAction: 3D Points as Universal Action Representations for Robot Control

Jun 02, 2026

Mutian Tong, Han Jiang, Qiao Feng, Lingjie Liu, Jiatao Gu

Abstract:Video-Action Models (VAMs) leverage the broad visual dynamics captured by pre-trained video diffusion models, offering a promising path toward generalizable robot manipulation. However, RGB-only video rollouts are not directly actionable: they leave metric 3D motion, contact geometry, and fine-grained spatial constraints under-specified, making action grounding ambiguous. Meanwhile, scaling action supervision across diverse tasks and embodiments remains costly. We present PointAction, a framework that bridges video predictions to robot actions through explicit point-based 4D modeling. PointAction fine-tunes a foundation video generation model to jointly predict future RGB frames and dynamic 3D pointmaps, producing temporally consistent 3D motion of task-relevant scene geometry. These point dynamics serve as a structured, embodiment-agnostic action interface, which a diffusion-based action decoder maps to executable robot actions. By using metric 3D point dynamics as the interface between video prediction and control, PointAction reduces the ambiguity of RGB-only action grounding and supports transfer across tasks and embodiments with limited action supervision. Experiments show that PointAction achieves state-of-the-art 4D generation quality on robot scenes, outperforms existing baselines in simulation, and generalizes to two real robot arms unseen during pretraining.

* Project page: https://oriontmt.github.io/pointaction/

Via

Access Paper or Ask Questions

Spatiotemporally Consistent Indoor Lighting Estimation with Diffusion Priors

Aug 11, 2025

Mutian Tong, Rundi Wu, Changxi Zheng

Figure 1 for Spatiotemporally Consistent Indoor Lighting Estimation with Diffusion Priors

Figure 2 for Spatiotemporally Consistent Indoor Lighting Estimation with Diffusion Priors

Figure 3 for Spatiotemporally Consistent Indoor Lighting Estimation with Diffusion Priors

Figure 4 for Spatiotemporally Consistent Indoor Lighting Estimation with Diffusion Priors

Abstract:Indoor lighting estimation from a single image or video remains a challenge due to its highly ill-posed nature, especially when the lighting condition of the scene varies spatially and temporally. We propose a method that estimates from an input video a continuous light field describing the spatiotemporally varying lighting of the scene. We leverage 2D diffusion priors for optimizing such light field represented as a MLP. To enable zero-shot generalization to in-the-wild scenes, we fine-tune a pre-trained image diffusion model to predict lighting at multiple locations by jointly inpainting multiple chrome balls as light probes. We evaluate our method on indoor lighting estimation from a single image or video and show superior performance over compared baselines. Most importantly, we highlight results on spatiotemporally consistent lighting estimation from in-the-wild videos, which is rarely demonstrated in previous works.

* SIGGRAPH '25: ACM SIGGRAPH 2025 Conference Conference Papers, Article 107, pages1-11, July 2025
* 11 pages. Accepted by SIGGRAPH 2025 as Conference Paper

Via

Access Paper or Ask Questions