A novel cognition-inspired, agnostic framework is proposed for building maps in mobile robotics that are efficient in terms of image matching/retrieval for solving Visual Place Recognition (VPR) problem. A dataset, 'ESSEX3IN1', is also presented to demonstrate the significantly enhanced performance of state-of-the-art VPR techniques when combined with the proposed framework.
Convolutional Neural Networks (CNNs) have recently been shown to excel at performing visual place recognition under changing appearance and viewpoint. Previously, place recognition has been improved by intelligently selecting relevant spatial keypoints within a convolutional layer and also by selecting the optimal layer to use. Rather than extracting features out of a particular layer, or a particular set of spatial keypoints within a layer, we propose the extraction of features using a subset of the channel dimensionality within a layer. Each feature map learns to encode a different set of weights that activate for different visual features within the set of training images. We propose a method of calibrating a CNN-based visual place recognition system, which selects the subset of feature maps that best encodes the visual features that are consistent between two different appearances of the same location. Using just 50 calibration images, all collected at the beginning of the current environment, we demonstrate a significant and consistent recognition improvement across multiple layers for two different neural networks. We evaluate our proposal on three datasets with different types of appearance changes - afternoon to morning, winter to summer and night to day. Additionally, the dimensionality reduction approach improves the computational processing speed of the recognition system.
K-Means clustering still plays an important role in many computer vision problems. While the conventional Lloyd method, which alternates between centroid update and cluster assignment, is primarily used in practice, it may converge to a solution with empty clusters. Furthermore, some applications may require the clusters to satisfy a specific set of constraints, e.g., cluster sizes, must-link/cannot-link. Several methods have been introduced to solve constrained K-Means clustering. Due to the non-convex nature of K-Means, however, existing approaches may result in sub-optimal solutions that poorly approximate the true clusters. In this work, we provide a new perspective to tackle this problem. Particularly, we reconsider constrained K-Means as a Binary Optimization Problem and propose a novel optimization scheme to search for feasible solutions in the binary domain. This approach allows us to solve constrained K-Means where multiple types of constraints can be simultaneously enforced. Experimental results on synthetic and real datasets show that our method provides better clustering accuracy with faster runtime compared to several commonly used techniques.
Robotic and animal mapping systems share many of the same objectives and challenges, but differ in one key aspect: where much of the research in robotic mapping has focused on solving the data association problem, the grid cell neurons underlying maps in the mammalian brain appear to intentionally break data association by encoding many locations with a single grid cell neuron. One potential benefit of this intentional aliasing is both sub-linear map storage and computational requirements growth with environment size, which we demonstrated in a previous proof-of-concept study that detected and encoded mutually complementary co-prime pattern frequencies in the visual map data. In this research, we solve several of the key theoretical and practical limitations of that prototype model and achieve significantly better sub-linear storage growth, a factor reduction in storage requirements per map location, scalability to large datasets on standard compute equipment and improved robustness to environments with visually challenging appearance change. These improvements are achieved through several innovations including a flexible user-driven choice mechanism for the periodic patterns underlying the new encoding method, a parallelized chunking technique that splits the map into sub-sections processed in parallel and a novel feature selection approach that selects only the image information most relevant to the encoded temporal patterns. We evaluate our techniques on two large benchmark datasets with the comparison to the previous state-of-the-art system, as well as providing a detailed analysis of system performance with respect to parameters such as required precision performance and the number of cyclic patterns encoded.
There has been a recent emergence of sampling-based techniques for estimating epistemic uncertainty in deep neural networks. While these methods can be applied to classification or semantic segmentation tasks by simply averaging samples, this is not the case for object detection, where detection sample bounding boxes must be accurately associated and merged. A weak merging strategy can significantly degrade the performance of the detector and yield an unreliable uncertainty measure. This paper provides the first in-depth investigation of the effect of different association and merging strategies. We compare different combinations of three spatial and two semantic affinity measures with four clustering methods for MC Dropout with a Single Shot Multi-Box Detector. Our results show that the correct choice of affinity-clustering combinations can greatly improve the effectiveness of the classification and spatial uncertainty estimation and the resulting object detection performance. We base our evaluation on a new mix of datasets that emulate near open-set conditions (semantically similar unknown classes), distant open-set conditions (semantically dissimilar unknown classes) and the common closed-set conditions (only known classes).
Current approaches to object-oriented SLAM lack the ability to incorporate prior knowledge of the scene geometry, such as the expected global orientation of objects. We overcome this limitation by proposing a geometric factor that constrains the global orientation of objects in the map, depending on the objects' semantics. This new geometric factor is a first example of how semantics can inform and improve geometry in object-oriented SLAM. We implement the geometric factor for the recently proposed QuadricSLAM that represents landmarks as dual quadrics. The factor probabilistically models the quadrics' major axes to be either perpendicular to or aligned with the direction of gravity, depending on their semantic class. Our experiments on simulated and real-world datasets show that using the proposed factors to incorporate prior knowledge improves both the trajectory and landmark quality.
In this paper, we use 2D object detections from multiple views to simultaneously estimate a 3D quadric surface for each object and localize the camera position. We derive a SLAM formulation that uses dual quadrics as 3D landmark representations, exploiting their ability to compactly represent the size, position and orientation of an object, and show how 2D object detections can directly constrain the quadric parameters via a novel geometric error formulation. We develop a sensor model for object detectors that addresses the challenge of partially visible objects, and demonstrate how to jointly estimate the camera pose and constrained dual quadric parameters in factor graph based SLAM with a general perspective camera.
Model-free reinforcement learning has recently been shown to be effective at learning navigation policies from complex image input. However, these algorithms tend to require large amounts of interaction with the environment, which can be prohibitively costly to obtain on robots in the real world. We present an approach for efficiently learning goal-directed navigation policies on a mobile robot, from only a single coverage traversal of recorded data. The navigation agent learns an effective policy over a diverse action space in a large heterogeneous environment consisting of more than 2km of travel, through buildings and outdoor regions that collectively exhibit large variations in visual appearance, self-similarity, and connectivity. We compare pretrained visual encoders that enable precomputation of visual embeddings to achieve a throughput of tens of thousands of transitions per second at training time on a commodity desktop computer, allowing agents to learn from millions of trajectories of experience in a matter of hours. We propose multiple forms of computationally efficient stochastic augmentation to enable the learned policy to generalise beyond these precomputed embeddings, and demonstrate successful deployment of the learned policy on the real robot without fine tuning, despite environmental appearance differences at test time. The dataset and code required to reproduce these results and apply the technique to other datasets and robots is made publicly available at rl-navigation.github.io/deployable.
Various approaches have been proposed to learn visuo-motor policies for real-world robotic applications. One solution is first learning in simulation then transferring to the real world. In the transfer, most existing approaches need real-world images with labels. However, the labelling process is often expensive or even impractical in many robotic applications. In this paper, we propose an adversarial discriminative sim-to-real transfer approach to reduce the cost of labelling real data. The effectiveness of the approach is demonstrated with modular networks in a table-top object reaching task where a 7 DoF arm is controlled in velocity mode to reach a blue cuboid in clutter through visual observations. The adversarial transfer approach reduced the labelled real data requirement by 50%. Policies can be transferred to real environments with only 93 labelled and 186 unlabelled real images. The transferred visuo-motor policies are robust to novel (not seen in training) objects in clutter and even a moving target, achieving a 97.8% success rate and 1.8 cm control accuracy.
Human visual scene understanding is so remarkable that we are able to recognize a revisited place when entering it from the opposite direction it was first visited, even in the presence of extreme variations in appearance. This capability is especially apparent during driving: a human driver can recognize where they are when travelling in the reverse direction along a route for the first time, without having to turn back and look. The difficulty of this problem exceeds any addressed in past appearance- and viewpoint-invariant visual place recognition (VPR) research, in part because large parts of the scene are not commonly observable from opposite directions. Consequently, as shown in this paper, the precision-recall performance of current state-of-the-art viewpoint- and appearance-invariant VPR techniques is orders of magnitude below what would be usable in a closed-loop system. Current engineered solutions predominantly rely on panoramic camera or LIDAR sensing setups; an eminently suitable engineering solution but one that is clearly very different to how humans navigate, which also has implications for how naturally humans could interact and communicate with the navigation system. In this paper we develop a suite of novel semantic- and appearance-based techniques to enable for the first time high performance place recognition in this challenging scenario. We first propose a novel Local Semantic Tensor (LoST) descriptor of images using the convolutional feature maps from a state-of-the-art dense semantic segmentation network. Then, to verify the spatial semantic arrangement of the top matching candidates, we develop a novel approach for mining semantically-salient keypoint correspondences.