Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yaoming Zhou

PM-Nav: Priori-Map Guided Embodied Navigation in Functional Buildings

Mar 10, 2026

Jiang Gao, Xiangyu Dong, Haozhou Li, Haoran Zhao, Yaoming Zhou, Xiaoguang Ma

Abstract:Existing language-driven embodied navigation paradigms face challenges in functional buildings (FBs) with highly similar features, as they lack the ability to effectively utilize priori spatial knowledge. To tackle this issue, we propose a Priori-Map Guided Embodied Navigation (PM-Nav), wherein environmental maps are transformed into navigation-friendly semantic priori-maps, a hierarchical chain-of-thought prompt template with an annotation priori-map is designed to enable precise path planning, and a multi-model collaborative action output mechanism is built to accomplish positioning decisions and execution control for navigation planning. Comprehensive tests using a home-made FB dataset show that the PM-Nav obtains average improvements of 511\% and 1175\%, and 650\% and 400\% over the SG-Nav and the InstructNav in simulation and real-world, respectively. These tremendous boosts elucidate the great potential of using the PM-Nav as a backbone navigation framework for FBs.

* 6 pages, 4 figures

Via

Access Paper or Ask Questions

CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval

Mar 09, 2026

Haozhou Li, Xiangyu Dong, Huiyan Jiang, Yaoming Zhou, Xiaoguang Ma

Abstract:Although large language models (LLMs) are introduced into vision-and-language navigation (VLN) to improve instruction comprehension and generalization, existing LLM- based VLN lacks the ability to selectively recall and use relevant priori experiences to help navigation tasks, limiting their performance in long-horizon and unfamiliar scenarios. In this work, we propose CMMR-VLN (Continual Multimodal Memory Retrieval based VLN), a VLN framework that endows LLM agents with structured memory and reflection capabilities. Specifically, the CMMR-VLN constructs a multimodal experi- ence memory indexed by panoramic visual images and salient landmarks to retrieve relevant experiences during navigation, introduces a retrieved-augmented generation pipeline to mimick how experienced human navigators leverage priori knowledge, and incorporates a reflection-based memory update strategy that selectively stores complete successful paths and the key initial mistake in failure cases. Comprehensive tests illustrate average success rate improvements of 52.9%, 20.9% and 20.9%, and 200%, 50% and 50% over the NavGPT, the MapGPT, and the DiscussNav in simulation and real tests, respectively eluci- dating the great potential of the CMMR-VLN as a backbone VLN framework.

Via

Access Paper or Ask Questions

ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation

Mar 09, 2026

Haoyu Tong, Xiangyu Dong, Xiaoguang Ma, Haoran Zhao, Yaoming Zhou, Chenghao Lin

Abstract:Existing aerial Vision-Language Navigation (VLN) methods predominantly adopt a detection-and-planning pipeline, which converts open-vocabulary detections into discrete textual scene graphs. These approaches are plagued by inadequate spatial reasoning capabilities and inherent linguistic ambiguities. To address these bottlenecks, we propose a Visual-Spatial Reasoning (ViSA) enhanced framework for aerial VLN. Specifically, a triple-phase collaborative architecture is designed to leverage structured visual prompting, enabling Vision-Language Models (VLMs) to perform direct reasoning on image planes without the need for additional training or complex intermediate representations. Comprehensive evaluations on the CityNav benchmark demonstrate that the ViSA-enhanced VLN achieves a 70.3\% improvement in success rate compared to the fully trained state-of-the-art (SOTA) method, elucidating its great potential as a backbone for aerial VLN systems.

* 8 pages

Via

Access Paper or Ask Questions

SE-VLN: A Self-Evolving Vision-Language Navigation Framework Based on Multimodal Large Language Models

Jul 17, 2025

Xiangyu Dong, Haoran Zhao, Jiang Gao, Haozhou Li, Xiaoguang Ma, Yaoming Zhou, Fuhai Chen, Juan Liu

Abstract:Recent advances in vision-language navigation (VLN) were mainly attributed to emerging large language models (LLMs). These methods exhibited excellent generalization capabilities in instruction understanding and task reasoning. However, they were constrained by the fixed knowledge bases and reasoning abilities of LLMs, preventing fully incorporating experiential knowledge and thus resulting in a lack of efficient evolutionary capacity. To address this, we drew inspiration from the evolution capabilities of natural agents, and proposed a self-evolving VLN framework (SE-VLN) to endow VLN agents with the ability to continuously evolve during testing. To the best of our knowledge, it was the first time that an multimodal LLM-powered self-evolving VLN framework was proposed. Specifically, SE-VLN comprised three core modules, i.e., a hierarchical memory module to transfer successful and failure cases into reusable knowledge, a retrieval-augmented thought-based reasoning module to retrieve experience and enable multi-step decision-making, and a reflection module to realize continual evolution. Comprehensive tests illustrated that the SE-VLN achieved navigation success rates of 57% and 35.2% in unseen environments, representing absolute performance improvements of 23.9% and 15.0% over current state-of-the-art methods on R2R and REVERSE datasets, respectively. Moreover, the SE-VLN showed performance improvement with increasing experience repository, elucidating its great potential as a self-evolving agent framework for VLN.

Via

Access Paper or Ask Questions

Agent as Cerebrum, Controller as Cerebellum: Implementing an Embodied LMM-based Agent on Drones

Nov 25, 2023

Haoran Zhao, Fengxing Pan, Huqiuyue Ping, Yaoming Zhou

Figure 1 for Agent as Cerebrum, Controller as Cerebellum: Implementing an Embodied LMM-based Agent on Drones

Figure 2 for Agent as Cerebrum, Controller as Cerebellum: Implementing an Embodied LMM-based Agent on Drones

Figure 3 for Agent as Cerebrum, Controller as Cerebellum: Implementing an Embodied LMM-based Agent on Drones

Figure 4 for Agent as Cerebrum, Controller as Cerebellum: Implementing an Embodied LMM-based Agent on Drones

Abstract:In this study, we present a novel paradigm for industrial robotic embodied agents, encapsulating an 'agent as cerebrum, controller as cerebellum' architecture. Our approach harnesses the power of Large Multimodal Models (LMMs) within an agent framework known as AeroAgent, tailored for drone technology in industrial settings. To facilitate seamless integration with robotic systems, we introduce ROSchain, a bespoke linkage framework connecting LMM-based agents to the Robot Operating System (ROS). We report findings from extensive empirical research, including simulated experiments on the Airgen and real-world case study, particularly in individual search and rescue operations. The results demonstrate AeroAgent's superior performance in comparison to existing Deep Reinforcement Learning (DRL)-based agents, highlighting the advantages of the embodied LMM in complex, real-world scenarios.

* 17 pages, 12 figures

Via

Access Paper or Ask Questions