Abstract:Visual Odometry (VO) is vital for the navigation of autonomous systems, providing accurate position and orientation estimates at reasonable costs. While traditional VO methods excel in some conditions, they struggle with challenges like variable lighting and motion blur. Deep learning-based VO, though more adaptable, can face generalization problems in new environments. Addressing these drawbacks, this paper presents a novel hybrid visual odometry (VO) framework that leverages pose-only supervision, offering a balanced solution between robustness and the need for extensive labeling. We propose two cost-effective and innovative designs: a self-supervised homographic pre-training for enhancing optical flow learning from pose-only labels and a random patch-based salient point detection strategy for more accurate optical flow patch extraction. These designs eliminate the need for dense optical flow labels for training and significantly improve the generalization capability of the system in diverse and challenging environments. Our pose-only supervised method achieves competitive performance on standard datasets and greater robustness and generalization ability in extreme and unseen scenarios, even compared to dense optical flow-supervised state-of-the-art methods.
Abstract:Autonomous robot navigation within the dynamic unknown environment is of crucial significance for mobile robotic applications including robot navigation in last-mile delivery and robot-enabled automated supplies in industrial and hospital delivery applications. Current solutions still suffer from limitations, such as the robot cannot recognize unknown objects in real time and cannot navigate freely in a dynamic, narrow, and complex environment. We propose a complete software framework for autonomous robot perception and navigation within very dense obstacles and dense human crowds. First, we propose a framework that accurately detects and segments open-world object categories in a zero-shot manner, which overcomes the over-segmentation limitation of the current SAM model. Second, we proposed the distillation strategy to distill the knowledge to segment the free space of the walkway for robot navigation without the label. In the meantime, we design the trimming strategy that works collaboratively with distillation to enable lightweight inference to deploy the neural network on edge devices such as NVIDIA-TX2 or Xavier NX during autonomous navigation. Integrated into the robot navigation system, extensive experiments demonstrate that our proposed framework has achieved superior performance in terms of both accuracy and efficiency in robot scene perception and autonomous robot navigation.
Abstract:We introduce an integrated precise LiDAR, Inertial, and Visual (LIV) multi-modal sensor fused mapping system that builds on the differentiable surface splatting to improve the mapping fidelity, quality, and structural accuracy. Notably, this is also a novel form of tightly coupled map for LiDAR-visual-inertial sensor fusion. This system leverages the complementary characteristics of LiDAR and visual data to capture the geometric structures of large-scale 3D scenes and restore their visual surface information with high fidelity. The initial poses for surface Gaussian scenes are obtained using a LiDAR-inertial system with size-adaptive voxels. Then, we optimized and refined the Gaussians by visual-derived photometric gradients to optimize the quality and density of LiDAR measurements. Our method is compatible with various types of LiDAR, including solid-state and mechanical LiDAR, supporting both repetitive and non-repetitive scanning modes. bolstering structure construction through LiDAR and facilitating real-time generation of photorealistic renderings across diverse LIV datasets. It showcases notable resilience and versatility in generating real-time photorealistic scenes potentially for digital twins and virtual reality while also holding potential applicability in real-time SLAM and robotics domains. We release our software and hardware and self-collected datasets on Github\footnote[3]{https://github.com/sheng00125/LIV-GaussMap} to benefit the community.
Abstract:The rapid evolution of autonomous vehicles (AVs) has significantly influenced global transportation systems. In this context, we present ``Snow Lion'', an autonomous shuttle meticulously designed to revolutionize on-campus transportation, offering a safer and more efficient mobility solution for students, faculty, and visitors. The primary objective of this research is to enhance campus mobility by providing a reliable, efficient, and eco-friendly transportation solution that seamlessly integrates with existing infrastructure and meets the diverse needs of a university setting. To achieve this goal, we delve into the intricacies of the system design, encompassing sensing, perception, localization, planning, and control aspects. We evaluate the autonomous shuttle's performance in real-world scenarios, involving a 1146-kilometer road haul and the transportation of 442 passengers over a two-month period. These experiments demonstrate the effectiveness of our system and offer valuable insights into the intricate process of integrating an autonomous vehicle within campus shuttle operations. Furthermore, a thorough analysis of the lessons derived from this experience furnishes a valuable real-world case study, accompanied by recommendations for future research and development in the field of autonomous driving.
Abstract:Existing state-of-the-art 3D point cloud understanding methods merely perform well in a fully supervised manner. To the best of our knowledge, there exists no unified framework that simultaneously solves the downstream high-level understanding tasks including both segmentation and detection, especially when labels are extremely limited. This work presents a general and simple framework to tackle point cloud understanding when labels are limited. The first contribution is that we have done extensive methodology comparisons of traditional and learned 3D descriptors for the task of weakly supervised 3D scene understanding, and validated that our adapted traditional PFH-based 3D descriptors show excellent generalization ability across different domains. The second contribution is that we proposed a learning-based region merging strategy based on the affinity provided by both the traditional/learned 3D descriptors and learned semantics. The merging process takes both low-level geometric and high-level semantic feature correlations into consideration. Experimental results demonstrate that our framework has the best performance among the three most important weakly supervised point clouds understanding tasks including semantic segmentation, instance segmentation, and object detection even when very limited number of points are labeled. Our method, termed Region Merging 3D (RM3D), has superior performance on ScanNet data-efficient learning online benchmarks and other four large-scale 3D understanding benchmarks under various experimental settings, outperforming current arts by a margin for various 3D understanding tasks without complicated learning strategies such as active learning.
Abstract:Existing state-of-the-art 3D point clouds understanding methods only perform well in a fully supervised manner. To the best of our knowledge, there exists no unified framework which simultaneously solves the downstream high-level understanding tasks, especially when labels are extremely limited. This work presents a general and simple framework to tackle point clouds understanding when labels are limited. We propose a novel unsupervised region expansion based clustering method for generating clusters. More importantly, we innovatively propose to learn to merge the over-divided clusters based on the local low-level geometric property similarities and the learned high-level feature similarities supervised by weak labels. Hence, the true weak labels guide pseudo labels merging taking both geometric and semantic feature correlations into consideration. Finally, the self-supervised reconstruction and data augmentation optimization modules are proposed to guide the propagation of labels among semantically similar points within a scene. Experimental Results demonstrate that our framework has the best performance among the three most important weakly supervised point clouds understanding tasks including semantic segmentation, instance segmentation, and object detection even when limited points are labeled, under the data-efficient settings for the large-scale 3D semantic scene parsing. The developed techniques have postentials to be applied to downstream tasks for better representations in robotic manipulation and robotic autonomous navigation. Codes and models are publicly available at: https://github.com/KangchengLiu.
Abstract:Deep neural network models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. However, the major bottleneck for current 3D recognition approaches is that they do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse kinds of real-world applications. In the meantime, current state-of-the-art 3D scene understanding approaches primarily require high-quality labels to train neural networks, which merely perform well in a fully supervised manner. This work presents a generalized and simple framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To leverage the boundary information, we propose a novel energy-based loss with boundary awareness benefiting from the region-level boundary predictions. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning. All codes, models, and data are made publicly available at: https://drive.google.com/drive/folders/1M58V-PtR8DBEwD296zJkNg_m2qq-MTAP?usp=sharing.
Abstract:One-shot LiDAR localization refers to the ability to estimate the robot pose from one single point cloud, which yields significant advantages in initialization and relocalization processes. In the point cloud domain, the topic has been extensively studied as a global descriptor retrieval (i.e., loop closure detection) and pose refinement (i.e., point cloud registration) problem both in isolation or combined. However, few have explicitly considered the relationship between candidate retrieval and correspondence generation in pose estimation, leaving them brittle to substructure ambiguities. To this end, we propose a hierarchical one-shot localization algorithm called Outram that leverages substructures of 3D scene graphs for locally consistent correspondence searching and global substructure-wise outlier pruning. Such a hierarchical process couples the feature retrieval and the correspondence extraction to resolve the substructure ambiguities by conducting a local-to-global consistency refinement. We demonstrate the capability of Outram in a variety of scenarios in multiple large-scale outdoor datasets. Our implementation is open-sourced: https://github.com/Pamphlett/Outram.
Abstract:Path planning for multiple tethered robots is a challenging problem due to the complex interactions among the cables and the possibility of severe entanglements. Previous works on this problem either consider idealistic cable models or provide no guarantee for entanglement-free paths. In this work, we present a new approach to address this problem using the theory of braids. By establishing a topological equivalence between the physical cables and the space-time trajectories of the robots, and identifying particular braid patterns that emerge from the entangled trajectories, we obtain the key finding that all complex entanglements stem from a finite number of interaction patterns between 2 or 3 robots. Hence, non-entanglement can be guaranteed by avoiding these interaction patterns in the trajectories of the robots. Based on this finding, we present a graph search algorithm using the permutation grid to efficiently search for a feasible topology of paths and reject braid patterns that result in an entanglement. We demonstrate that the proposed algorithm can achieve 100% goal-reaching capability without entanglement for up to 10 drones with a slack cable model in a high-fidelity simulation platform. The practicality of the proposed approach is verified using three small tethered UAVs in indoor flight experiments.
Abstract:Robust point cloud parsing under all-weather conditions is crucial to level-5 autonomy in autonomous driving. However, how to learn a universal 3D semantic segmentation (3DSS) model is largely neglected as most existing benchmarks are dominated by point clouds captured under normal weather. We introduce SemanticSTF, an adverse-weather point cloud dataset that provides dense point-level annotations and allows to study 3DSS under various adverse weather conditions. We study all-weather 3DSS modeling under two setups: 1) domain adaptive 3DSS that adapts from normal-weather data to adverse-weather data; 2) domain generalizable 3DSS that learns all-weather 3DSS models from normal-weather data. Our studies reveal the challenge while existing 3DSS methods encounter adverse-weather data, showing the great value of SemanticSTF in steering the future endeavor along this very meaningful research direction. In addition, we design a domain randomization technique that alternatively randomizes the geometry styles of point clouds and aggregates their embeddings, ultimately leading to a generalizable model that can improve 3DSS under various adverse weather effectively. The SemanticSTF and related codes are available at \url{https://github.com/xiaoaoran/SemanticSTF}.