Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:VISTA: Generative Visual Imagination for Vision-and-Language Navigation

May 17, 2025

Yanjia Huang, Mingyang Wu, Renjie Li, Zhengzhong Tu

Figure 1 for VISTA: Generative Visual Imagination for Vision-and-Language Navigation

Figure 2 for VISTA: Generative Visual Imagination for Vision-and-Language Navigation

Figure 3 for VISTA: Generative Visual Imagination for Vision-and-Language Navigation

Figure 4 for VISTA: Generative Visual Imagination for Vision-and-Language Navigation

Share this with someone who'll enjoy it:

Abstract:Vision-and-Language Navigation (VLN) tasks agents with locating specific objects in unseen environments using natural language instructions and visual cues. Many existing VLN approaches typically follow an 'observe-and-reason' schema, that is, agents observe the environment and decide on the next action to take based on the visual observations of their surroundings. They often face challenges in long-horizon scenarios due to limitations in immediate observation and vision-language modality gaps. To overcome this, we present VISTA, a novel framework that employs an 'imagine-and-align' navigation strategy. Specifically, we leverage the generative prior of pre-trained diffusion models for dynamic visual imagination conditioned on both local observations and high-level language instructions. A Perceptual Alignment Filter module then grounds these goal imaginations against current observations, guiding an interpretable and structured reasoning process for action selection. Experiments show that VISTA sets new state-of-the-art results on Room-to-Room (R2R) and RoboTHOR benchmarks, e.g.,+3.6% increase in Success Rate on R2R. Extensive ablation analysis underscores the value of integrating forward-looking imagination, perceptual alignment, and structured reasoning for robust navigation in long-horizon environments.

* 13 pages, 5 figures

View paper on

Share this with someone who'll enjoy it:

Title:VISTA: Generative Visual Imagination for Vision-and-Language Navigation

Paper and Code