Reliable localization is crucial for autonomous robots to navigate efficiently and safely. Some navigation methods can plan paths with high localizability (which describes the capability of acquiring reliable localization). By following these paths, the robot can access the sensor streams that facilitate more accurate location estimation results by the localization algorithms. However, most of these methods require prior knowledge and struggle to adapt to unseen scenarios or dynamic changes. To overcome these limitations, we propose a novel approach for localizability-enhanced navigation via deep reinforcement learning in dynamic human environments. Our proposed planner automatically extracts geometric features from 2D laser data that are helpful for localization. The planner learns to assign different importance to the geometric features and encourages the robot to navigate through areas that are helpful for laser localization. To facilitate the learning of the planner, we suggest two techniques: (1) an augmented state representation that considers the dynamic changes and the confidence of the localization results, which provides more information and allows the robot to make better decisions, (2) a reward metric that is capable to offer both sparse and dense feedback on behaviors that affect localization accuracy. Our method exhibits significant improvements in lost rate and arrival rate when tested in previously unseen environments.
Recently, it has become popular to deploy sensors such as LiDARs on the roadside to monitor the passing traffic and assist autonomous vehicle perception. Unlike autonomous vehicle systems, roadside sensors are usually affiliated with different subsystems and lack synchronization both in time and space. Calibration is a key technology which allows the central server to fuse the data generated by different location infrastructures, which can deliver improve the sensing range and detection robustness. Unfortunately, existing calibration algorithms often assume that the LiDARs are significantly overlapped or that the temporal calibration is already achieved. Since these assumptions do not always hold in the real world, the calibration results from the existing algorithms are often unsatisfactory and always need human involvement, which brings high labor costs. In this paper, we propose TrajMatch -- the first system that can automatically calibrate for roadside LiDARs in both time and space. The main idea is to automatically calibrate the sensors based on the result of the detection/tracking task instead of extracting special features. More deeply, we propose a mechanism for evaluating calibration parameters that is consistent with our algorithm, and we demonstrate the effectiveness of this scheme experimentally, which can also be used to guide parameter iterations for multiple calibration. Finally, to evaluate the performance of TrajMatch , we collect two dataset, one simulated dataset LiDARnet-sim 1.0 and a real-world dataset. Experiment results show that TrajMatch can achieve a spatial calibration error of less than 10cm and a temporal calibration error of less than 1.5ms.
The recent trend for multi-camera 3D object detection is through the unified bird's-eye view (BEV) representation. However, directly transforming features extracted from the image-plane view to BEV inevitably results in feature distortion, especially around the objects of interest, making the objects blur into the background. To this end, we propose OA-BEV, a network that can be plugged into the BEV-based 3D object detection framework to bring out the objects by incorporating object-aware pseudo-3D features and depth features. Such features contain information about the object's position and 3D structures. First, we explicitly guide the network to learn the depth distribution by object-level supervision from each 3D object's center. Then, we select the foreground pixels by a 2D object detector and project them into 3D space for pseudo-voxel feature encoding. Finally, the object-aware depth features and pseudo-voxel features are incorporated into the BEV representation with a deformable attention mechanism. We conduct extensive experiments on the nuScenes dataset to validate the merits of our proposed OA-BEV. Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and nuScenes detection score. Our codes will be published.
Tensor program tuning is a non-convex objective optimization problem, to which search-based approaches have proven to be effective. At the core of the search-based approaches lies the design of the cost model. Though deep learning-based cost models perform significantly better than other methods, they still fall short and suffer from the following problems. First, their feature extraction heavily relies on expert-level domain knowledge in hardware architectures. Even so, the extracted features are often unsatisfactory and require separate considerations for CPUs and GPUs. Second, a cost model trained on one hardware platform usually performs poorly on another, a problem we call cross-hardware unavailability. In order to address these problems, we propose TLP and MTLTLP. TLP is a deep learning-based cost model that facilitates tensor program tuning. Instead of extracting features from the tensor program itself, TLP extracts features from the schedule primitives. We treat schedule primitives as tensor languages. TLP is thus a Tensor Language Processing task. In this way, the task of predicting the tensor program latency through the cost model is transformed into a natural language processing (NLP) regression task. MTL-TLP combines Multi-Task Learning and TLP to cope with the cross-hardware unavailability problem. We incorporate these techniques into the Ansor framework and conduct detailed experiments. Results show that TLP can speed up the average search time by 9.1X and 3.0X on CPU and GPU workloads, respectively, compared to the state-of-the-art implementation. MTL-TLP can achieve a speed-up of 4.7X and 2.9X on CPU and GPU workloads, respectively, using only 7% of the target hardware data.
In this letter, we propose MAROAM, a millimeter wave radar-based SLAM framework, which employs a two-step feature selection process to build the global consistent map. Specifically, we first extract feature points from raw data based on their local geometric properties to filter out those points that violate the principle of millimeter-wave radar imaging. Then, we further employ another round of probabilistic feature selection by examining how often and how recent the feature point has been detected in the proceeding frames. With such a two-step feature selection, we establish a global consistent map for accurate and robust pose estimation as well as other downstream tasks. At last, we perform loop closure and graph optimization in the back-end, further reducing the accumulated drift error. We evaluate the performance of MAROAM on the three datasets: the Oxford Radar RobotCar Dataset, the MulRan Dataset and the Boreas Dataset. We consider a variety of experimental settings with different scenery, weather, and road conditions. The experimental results show that the accuracy of MAROAM is 7.95%, 37.0% and 8.9% higher than the currently best-performing algorithms on these three datasets, respectively. The ablation results also show that our map-based odometry performs 28.6% better than the commonly used scan-to-frames method. Finally, as devoted contributors to the open-source community, we will open source the algorithm after the paper is accepted.
Simultaneous localization and mapping (SLAM) based on laser sensors has been widely adopted by mobile robots and autonomous vehicles. These SLAM systems are required to support accurate localization with limited computational resources. In particular, point cloud registration, i.e., the process of matching and aligning multiple LiDAR scans collected at multiple locations in a global coordinate framework, has been deemed as the bottleneck step in SLAM. In this paper, we propose a feature filtering algorithm, PFilter, that can filter out invalid features and can thus greatly alleviate this bottleneck. Meanwhile, the overall registration accuracy is also improved due to the carefully curated feature points. We integrate PFilter into the well-established scan-to-map LiDAR odometry framework, F-LOAM, and evaluate its performance on the KITTI dataset. The experimental results show that PFilter can remove about 48.4% of the points in the local feature map and reduce feature points in scan by 19.3% on average, which save 20.9% processing time per frame. In the mean time, we improve the accuracy by 9.4%.
Existing navigation policies for autonomous robots tend to focus on collision avoidance while ignoring human-robot interactions in social life. For instance, robots can pass along the corridor safer and easier if pedestrians notice them. Sounds have been considered as an efficient way to attract the attention of pedestrians, which can alleviate the freezing robot problem. In this work, we present a new deep reinforcement learning (DRL) based social navigation approach for autonomous robots to move in pedestrian-rich environments with interaction capacity. Most existing DRL based methods intend to train a general policy that outputs both navigation actions, i.e., expected robot's linear and angular velocities, and interaction actions, i.e., the beep action, in the context of reinforcement learning. Different from these methods, we intend to train the policy via both supervised learning and reinforcement learning. In specific, we first train an interaction policy in the context of supervised learning, which provides a better understanding of the social situation, then we use this interaction policy to train the navigation policy via multiple reinforcement learning algorithms. We evaluate our approach in various simulation environments and compare it to other methods. The experimental results show that our approach outperforms others in terms of the success rate. We also deploy the trained policy on a real-world robot, which shows a nice performance in crowded environments.
The dependency tree of a natural language sentence can capture the interactions between semantics and words. However, it is unclear whether those methods which exploit such dependency information for semantic parsing can be combined to achieve further improvement and the relationship of those methods when they combine. In this paper, we examine three methods to incorporate such dependency information in a Transformer based semantic parser and empirically study their combinations. We first replace standard self-attention heads in the encoder with parent-scaled self-attention (PASCAL) heads, i.e., the ones that can attend to the dependency parent of each token. Then we concatenate syntax-aware word representations (SAWRs), i.e., the intermediate hidden representations of a neural dependency parser, with ordinary word embedding to enhance the encoder. Later, we insert the constituent attention (CA) module to the encoder, which adds an extra constraint to attention heads that can better capture the inherent dependency structure of input sentences. Transductive ensemble learning (TEL) is used for model aggregation, and an ablation study is conducted to show the contribution of each method. Our experiments show that CA is complementary to PASCAL or SAWRs, and PASCAL + CA provides state-of-the-art performance among neural approaches on ATIS, GEO, and JOBS.
It has been well recognized that fusing the complementary information from depth-aware LiDAR point clouds and semantic-rich stereo images would benefit 3D object detection. Nevertheless, it is not trivial to explore the inherently unnatural interaction between sparse 3D points and dense 2D pixels. To ease this difficulty, the recent proposals generally project the 3D points onto the 2D image plane to sample the image data and then aggregate the data at the points. However, this approach often suffers from the mismatch between the resolution of point clouds and RGB images, leading to sub-optimal performance. Specifically, taking the sparse points as the multi-modal data aggregation locations causes severe information loss for high-resolution images, which in turn undermines the effectiveness of multi-sensor fusion. In this paper, we present VPFNet -- a new architecture that cleverly aligns and aggregates the point cloud and image data at the `virtual' points. Particularly, with their density lying between that of the 3D points and 2D pixels, the virtual points can nicely bridge the resolution gap between the two sensors, and thus preserve more information for processing. Moreover, we also investigate the data augmentation techniques that can be applied to both point clouds and RGB images, as the data augmentation has made non-negligible contribution towards 3D object detectors to date. We have conducted extensive experiments on KITTI dataset, and have observed good performance compared to the state-of-the-art methods. Remarkably, our VPFNet achieves 83.21\% moderate 3D AP and 91.86\% moderate BEV AP on the KITTI test set, ranking the 1st since May 21th, 2021. The network design also takes computation efficiency into consideration -- we can achieve a FPS of 15 on a single NVIDIA RTX 2080Ti GPU. The code will be made available for reproduction and further investigation.