Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics

Mar 31, 2026

Andrew Jeong, Jaemin Kim, Sebin Lee, Sung-Eui Yoon

Share this with someone who'll enjoy it:

Abstract:Robotic manipulation involves kinematic and semantic transitions that are inherently coupled via underlying actions. However, existing approaches plan within either semantic or latent space without explicitly aligning these cross-modal transitions. To address this, we propose CLaD, a framework that models how proprioceptive and semantic states jointly evolve under actions through asymmetric cross-attention that allows kinematic transitions to query semantic ones. CLaD predicts grounded latent foresights via self-supervised objectives with EMA target encoders and auxiliary reconstruction losses, preventing representation collapse while anchoring predictions to observable states. Predicted foresights are modulated with observations to condition a diffusion policy for action generation. On LIBERO-LONG benchmark, CLaD achieves 94.7\% success rate, competitive with large VLAs with significantly fewer parameters.

* Project page: https://andrewwwj.github.io/clad

View paper on

Share this with someone who'll enjoy it:

Title:CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics

Paper and Code