Alert button
Picture for Hanqing Wang

Hanqing Wang

Alert button

DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation

Aug 14, 2023
Hanqing Wang, Wei Liang, Luc Van Gool, Wenguan Wang

Figure 1 for DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation
Figure 2 for DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation
Figure 3 for DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation
Figure 4 for DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation

VLN-CE is a recently released embodied task, where AI agents need to navigate a freely traversable environment to reach a distant target location, given language instructions. It poses great challenges due to the huge space of possible strategies. Driven by the belief that the ability to anticipate the consequences of future actions is crucial for the emergence of intelligent and interpretable planning behavior, we propose DREAMWALKER -- a world model based VLN-CE agent. The world model is built to summarize the visual, topological, and dynamic properties of the complicated continuous environment into a discrete, structured, and compact representation. DREAMWALKER can simulate and evaluate possible plans entirely in such internal abstract world, before executing costly actions. As opposed to existing model-free VLN-CE agents simply making greedy decisions in the real world, which easily results in shortsighted behaviors, DREAMWALKER is able to make strategic planning through large amounts of ``mental experiments.'' Moreover, the imagined future scenarios reflect our agent's intention, making its decision-making process more transparent. Extensive experiments and ablation studies on VLN-CE dataset confirm the effectiveness of the proposed approach and outline fruitful directions for future work.

* Accepted at ICCV 2023; Project page: https://github.com/hanqingwangai/Dreamwalker 
Viaarxiv icon

ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

Apr 07, 2023
Dong An, Hanqing Wang, Wenguan Wang, Zun Wang, Yan Huang, Keji He, Liang Wang

Figure 1 for ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments
Figure 2 for ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments
Figure 3 for ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments
Figure 4 for ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. It becomes increasingly crucial in the field of embodied AI, with potential applications in autonomous navigation, search and rescue, and human-robot interaction. In this paper, we propose to address a more practical yet challenging counterpart setting - vision-language navigation in continuous environments (VLN-CE). To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav performs online topological mapping of environments by self-organizing predicted waypoints along a traversed path, without prior environmental experience. It privileges the agent to break down the navigation procedure into high-level planning and low-level control. Concurrently, ETPNav utilizes a transformer-based cross-modal planner to generate navigation plans based on topological maps and instructions. The plan is then performed through an obstacle-avoiding controller that leverages a trial-and-error heuristic to prevent navigation from getting stuck in obstacles. Experimental results demonstrate the effectiveness of the proposed method. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets, respectively. Our code is available at https://github.com/MarSaKi/ETPNav.

* Project page: https://github.com/MarSaKi/ETPNav 
Viaarxiv icon

Multilingual Sentence Transformer as A Multilingual Word Aligner

Jan 28, 2023
Weikang Wang, Guanhua Chen, Hanqing Wang, Yue Han, Yun Chen

Figure 1 for Multilingual Sentence Transformer as A Multilingual Word Aligner
Figure 2 for Multilingual Sentence Transformer as A Multilingual Word Aligner
Figure 3 for Multilingual Sentence Transformer as A Multilingual Word Aligner
Figure 4 for Multilingual Sentence Transformer as A Multilingual Word Aligner

Multilingual pretrained language models (mPLMs) have shown their effectiveness in multilingual word alignment induction. However, these methods usually start from mBERT or XLM-R. In this paper, we investigate whether multilingual sentence Transformer LaBSE is a strong multilingual word aligner. This idea is non-trivial as LaBSE is trained to learn language-agnostic sentence-level embeddings, while the alignment extraction task requires the more fine-grained word-level embeddings to be language-agnostic. We demonstrate that the vanilla LaBSE outperforms other mPLMs currently used in the alignment task, and then propose to finetune LaBSE on parallel corpus for further improvement. Experiment results on seven language pairs show that our best aligner outperforms previous state-of-the-art models of all varieties. In addition, our aligner supports different language pairs in a single model, and even achieves new state-of-the-art on zero-shot language pairs that does not appear in the finetuning process.

* Published at Findings of EMNLP 2022 
Viaarxiv icon

Towards Versatile Embodied Navigation

Oct 30, 2022
Hanqing Wang, Wei Liang, Luc Van Gool, Wenguan Wang

Figure 1 for Towards Versatile Embodied Navigation
Figure 2 for Towards Versatile Embodied Navigation
Figure 3 for Towards Versatile Embodied Navigation
Figure 4 for Towards Versatile Embodied Navigation

With the emergence of varied visual navigation tasks (e.g, image-/object-/audio-goal and vision-language navigation) that specify the target in different ways, the community has made appealing advances in training specialized agents capable of handling individual navigation tasks well. Given plenty of embodied navigation tasks and task-specific solutions, we address a more fundamental question: can we learn a single powerful agent that masters not one but multiple navigation tasks concurrently? First, we propose VXN, a large-scale 3D dataset that instantiates four classic navigation tasks in standardized, continuous, and audiovisual-rich environments. Second, we propose Vienna, a versatile embodied navigation agent that simultaneously learns to perform the four navigation tasks with one model. Building upon a full-attentive architecture, Vienna formulates various navigation tasks as a unified, parse-and-query procedure: the target description, augmented with four task embeddings, is comprehensively interpreted into a set of diversified goal vectors, which are refined as the navigation progresses, and used as queries to retrieve supportive context from episodic history for decision making. This enables the reuse of knowledge across navigation tasks with varying input domains/modalities. We empirically demonstrate that, compared with learning each visual navigation task individually, our multitask agent achieves comparable or even better performance with reduced complexity.

* Accepted to NeurIPS 2022; Code: https://github.com/hanqingwangai/VXN 
Viaarxiv icon

Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation

Mar 30, 2022
Hanqing Wang, Wei Liang, Jianbing Shen, Luc Van Gool, Wenguan Wang

Figure 1 for Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation
Figure 2 for Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation
Figure 3 for Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation
Figure 4 for Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation

Since the rise of vision-language navigation (VLN), great progress has been made in instruction following -- building a follower to navigate environments under the guidance of instructions. However, far less attention has been paid to the inverse task: instruction generation -- learning a speaker~to generate grounded descriptions for navigation routes. Existing VLN methods train a speaker independently and often treat it as a data augmentation tool to strengthen the follower while ignoring rich cross-task relations. Here we describe an approach that learns the two tasks simultaneously and exploits their intrinsic correlations to boost the training of each: the follower judges whether the speaker-created instruction explains the original navigation route correctly, and vice versa. Without the need of aligned instruction-path pairs, such cycle-consistent learning scheme is complementary to task-specific training targets defined on labeled data, and can also be applied over unlabeled paths (sampled without paired instructions). Another agent, called~creator is added to generate counterfactual environments. It greatly changes current scenes yet leaves novel items -- which are vital for the execution of original instructions -- unchanged. Thus more informative training scenes are synthesized and the three agents compose a powerful VLN learning system. Extensive experiments on a standard benchmark show that our approach improves the performance of various follower models and produces accurate navigation instructions.

* Accepted to CVPR 2022 
Viaarxiv icon

Distributed Expectation Propagation Detection for Cell-Free Massive MIMO

Aug 21, 2021
Hengtao He, Hanqing Wang, Xianghao Yu, Jun Zhang, S. H. Song, Khaled B. Letaief

Figure 1 for Distributed Expectation Propagation Detection for Cell-Free Massive MIMO
Figure 2 for Distributed Expectation Propagation Detection for Cell-Free Massive MIMO
Figure 3 for Distributed Expectation Propagation Detection for Cell-Free Massive MIMO
Figure 4 for Distributed Expectation Propagation Detection for Cell-Free Massive MIMO

In cell-free massive MIMO networks, an efficient distributed detection algorithm is of significant importance. In this paper, we propose a distributed expectation propagation (EP) detector for cell-free massive MIMO. The detector is composed of two modules, a nonlinear module at the central processing unit (CPU) and a linear module at the access point (AP). The turbo principle in iterative decoding is utilized to compute and pass the extrinsic information between modules. An analytical framework is then provided to characterize the asymptotic performance of the proposed EP detector with a large number of antennas. Simulation results will show that the proposed method outperforms the distributed detectors in terms of bit-error-rate.

* 6 pages, 5 figures, 2 tables, Accepted by IEEE Globecom 2021 
Viaarxiv icon

PEARL: Parallelized Expert-Assisted Reinforcement Learning for Scene Rearrangement Planning

May 10, 2021
Hanqing Wang, Zan Wang, Wei Liang, Lap-Fai Yu

Figure 1 for PEARL: Parallelized Expert-Assisted Reinforcement Learning for Scene Rearrangement Planning
Figure 2 for PEARL: Parallelized Expert-Assisted Reinforcement Learning for Scene Rearrangement Planning
Figure 3 for PEARL: Parallelized Expert-Assisted Reinforcement Learning for Scene Rearrangement Planning
Figure 4 for PEARL: Parallelized Expert-Assisted Reinforcement Learning for Scene Rearrangement Planning

Scene Rearrangement Planning (SRP) is an interior task proposed recently. The previous work defines the action space of this task with handcrafted coarse-grained actions that are inflexible to be used for transforming scene arrangement and intractable to be deployed in practice. Additionally, this new task lacks realistic indoor scene rearrangement data to feed popular data-hungry learning approaches and meet the needs of quantitative evaluation. To address these problems, we propose a fine-grained action definition for SRP and introduce a large-scale scene rearrangement dataset. We also propose a novel learning paradigm to efficiently train an agent through self-playing, without any prior knowledge. The agent trained via our paradigm achieves superior performance on the introduced dataset compared to the baseline agents. We provide a detailed analysis of the design of our approach in our experiments.

* 7 pages, 4 figures 
Viaarxiv icon

Structured Scene Memory for Vision-Language Navigation

Mar 05, 2021
Hanqing Wang, Wenguan Wang, Wei Liang, Caiming Xiong, Jianbing Shen

Figure 1 for Structured Scene Memory for Vision-Language Navigation
Figure 2 for Structured Scene Memory for Vision-Language Navigation
Figure 3 for Structured Scene Memory for Vision-Language Navigation
Figure 4 for Structured Scene Memory for Vision-Language Navigation

Recently, numerous algorithms have been developed to tackle the problem of vision-language navigation (VLN), i.e., entailing an agent to navigate 3D environments through following linguistic instructions. However, current VLN agents simply store their past experiences/observations as latent states in recurrent networks, failing to capture environment layouts and make long-term planning. To address these limitations, we propose a crucial architecture, called Structured Scene Memory (SSM). It is compartmentalized enough to accurately memorize the percepts during navigation. It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment. SSM has a collect-read controller that adaptively collects information for supporting current decision making and mimics iterative algorithms for long-range reasoning. As SSM provides a complete action space, i.e., all the navigable places on the map, a frontier-exploration based navigation decision making strategy is introduced to enable efficient and global planning. Experiment results on two VLN datasets (i.e., R2R and R4R) show that our method achieves state-of-the-art performance on several metrics.

* Accepted on CVPR2021; Implementation will be available at https://github.com/HanqingWangAI/SSM-VLN 
Viaarxiv icon

Active Visual Information Gathering for Vision-Language Navigation

Aug 19, 2020
Hanqing Wang, Wenguan Wang, Tianmin Shu, Wei Liang, Jianbing Shen

Figure 1 for Active Visual Information Gathering for Vision-Language Navigation
Figure 2 for Active Visual Information Gathering for Vision-Language Navigation
Figure 3 for Active Visual Information Gathering for Vision-Language Navigation
Figure 4 for Active Visual Information Gathering for Vision-Language Navigation

Vision-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments. One of the key challenges in VLN is how to conduct a robust navigation by mitigating the uncertainty caused by ambiguous instructions and insufficient observation of the environment. Agents trained by current approaches typically suffer from this and would consequently struggle to avoid random and inefficient actions at every step. In contrast, when humans face such a challenge, they can still maintain robust navigation by actively exploring the surroundings to gather more information and thus make more confident navigation decisions. This work draws inspiration from human navigation behavior and endows an agent with an active information gathering ability for a more intelligent vision-language navigation policy. To achieve this, we propose an end-to-end framework for learning an exploration policy that decides i) when and where to explore, ii) what information is worth gathering during exploration, and iii) how to adjust the navigation decision after the exploration. The experimental results show promising exploration strategies emerged from training, which leads to significant boost in navigation performance. On the R2R challenge leaderboard, our agent gets promising results all three VLN settings, i.e., single run, pre-exploration, and beam search.

* ECCV2020 (changed with improved perfromance on Pre-Explore and Beam Search settings); website: https://github.com/HanqingWangAI/Active_VLN 
Viaarxiv icon