Alert button
Picture for Dongfang Liu

Dongfang Liu

Alert button

DenserNet: Weakly Supervised Visual Localization Using Multi-scale Feature Aggregation

Dec 31, 2020
Dongfang Liu, Yiming Cui, Liqi Yan, Christos Mousas, Baijian Yang, Yingjie Chen

Figure 1 for DenserNet: Weakly Supervised Visual Localization Using Multi-scale Feature Aggregation
Figure 2 for DenserNet: Weakly Supervised Visual Localization Using Multi-scale Feature Aggregation
Figure 3 for DenserNet: Weakly Supervised Visual Localization Using Multi-scale Feature Aggregation
Figure 4 for DenserNet: Weakly Supervised Visual Localization Using Multi-scale Feature Aggregation

In this work, we introduce a Denser Feature Network (DenserNet) for visual localization. Our work provides three principal contributions. First, we develop a convolutional neural network (CNN) architecture which aggregates feature maps at different semantic levels for image representations. Using denser feature maps, our method can produce more keypoint features and increase image retrieval accuracy. Second, our model is trained end-to-end without pixel-level annotation other than positive and negative GPS-tagged image pairs. We use a weakly supervised triplet ranking loss to learn discriminative features and encourage keypoint feature repeatability for image representation. Finally, our method is computationally efficient as our architecture has shared features and parameters during computation. Our method can perform accurate large-scale localization under challenging conditions while remaining the computational constraint. Extensive experiment results indicate that our method sets a new state-of-the-art on four challenging large-scale localization benchmarks and three image retrieval benchmarks.

* Proceeding with The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) 
Viaarxiv icon

A Vector-based Representation to Enhance Head Pose Estimation

Oct 14, 2020
Zhiwen Cao, Zongcheng Chu, Dongfang Liu, Yingjie Chen

Figure 1 for A Vector-based Representation to Enhance Head Pose Estimation
Figure 2 for A Vector-based Representation to Enhance Head Pose Estimation
Figure 3 for A Vector-based Representation to Enhance Head Pose Estimation
Figure 4 for A Vector-based Representation to Enhance Head Pose Estimation

This paper proposes to use the three vectors in a rotation matrix as the representation in head pose estimation and develops a new neural network based on the characteristic of such representation. We address two potential issues existed in current head pose estimation works: 1. Public datasets for head pose estimation use either Euler angles or quaternions to annotate data samples. However, both of these annotations have the issue of discontinuity and thus could result in some performance issues in neural network training. 2. Most research works report Mean Absolute Error (MAE) of Euler angles as the measurement of performance. We show that MAE may not reflect the actual behavior especially for the cases of profile views. To solve these two problems, we propose a new annotation method which uses three vectors to describe head poses and a new measurement Mean Absolute Error of Vectors (MAEV) to assess the performance. We also train a new neural network to predict the three vectors with the constraints of orthogonality. Our proposed method achieves state-of-the-art results on both AFLW2000 and BIWI datasets. Experiments show our vector-based annotation method can effectively reduce prediction errors for large pose angles.

* 10 pages, 8 figures 
Viaarxiv icon

Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning

Sep 01, 2020
Liqi Yan, Dongfang Liu, Yaoxian Song, Changbin Yu

Figure 1 for Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning
Figure 2 for Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning
Figure 3 for Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning
Figure 4 for Multimodal Aggregation Approach for Memory Vision-Voice Indoor Navigation with Meta-Learning

Vision and voice are two vital keys for agents' interaction and learning. In this paper, we present a novel indoor navigation model called Memory Vision-Voice Indoor Navigation (MVV-IN), which receives voice commands and analyzes multimodal information of visual observation in order to enhance robots' environment understanding. We make use of single RGB images taken by a first-view monocular camera. We also apply a self-attention mechanism to keep the agent focusing on key areas. Memory is important for the agent to avoid repeating certain tasks unnecessarily and in order for it to adapt adequately to new scenes, therefore, we make use of meta-learning. We have experimented with various functional features extracted from visual observation. Comparative experiments prove that our methods outperform state-of-the-art baselines.

* 8 pages, 6 figures, 2 tables, accepted at 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2020) 
Viaarxiv icon

Visual Localization for Autonomous Driving: Mapping the Accurate Location in the City Maze

Aug 13, 2020
Dongfang Liu, Yiming Cui, Xiaolei Guo, Wei Ding, Baijian Yang, Yingjie Chen

Figure 1 for Visual Localization for Autonomous Driving: Mapping the Accurate Location in the City Maze
Figure 2 for Visual Localization for Autonomous Driving: Mapping the Accurate Location in the City Maze
Figure 3 for Visual Localization for Autonomous Driving: Mapping the Accurate Location in the City Maze
Figure 4 for Visual Localization for Autonomous Driving: Mapping the Accurate Location in the City Maze

Accurate localization is a foundational capacity, required for autonomous vehicles to accomplish other tasks such as navigation or path planning. It is a common practice for vehicles to use GPS to acquire location information. However, the application of GPS can result in severe challenges when vehicles run within the inner city where different kinds of structures may shadow the GPS signal and lead to inaccurate location results. To address the localization challenges of urban settings, we propose a novel feature voting technique for visual localization. Different from the conventional front-view-based method, our approach employs views from three directions (front, left, and right) and thus significantly improves the robustness of location prediction. In our work, we craft the proposed feature voting method into three state-of-the-art visual localization networks and modify their architectures properly so that they can be applied for vehicular operation. Extensive field test results indicate that our approach can predict location robustly even in challenging inner-city settings. Our research sheds light on using the visual localization approach to help autonomous vehicles to find accurate location information in a city maze, within a desirable time constraint.

Viaarxiv icon