To achieve robust motion estimation in visually degraded environments, thermal odometry has been an attraction in the robotics community. However, most thermal odometry methods are purely based on classical feature extractors, which is difficult to establish robust correspondences in successive frames due to sudden photometric changes and large thermal noise. To solve this problem, we propose ThermalPoint, a lightweight feature detection network specifically tailored for producing keypoints on thermal images, providing notable anti-noise improvements compared with other state-of-the-art methods. After that, we combine ThermalPoint with a novel radiometric feature tracking method, which directly makes use of full radiometric data and establishes reliable correspondences between sequential frames. Finally, taking advantage of an optimization-based visual-inertial framework, a deep feature-based thermal-inertial odometry (TP-TIO) framework is proposed and evaluated thoroughly in various visually degraded environments. Experiments show that our method outperforms state-of-the-art visual and laser odometry methods in smoke-filled environments and achieves competitive accuracy in normal environments.
Aerial vehicles are revolutionizing the way film-makers can capture shots of actors by composing novel aerial and dynamic viewpoints. However, despite great advancements in autonomous flight technology, generating expressive camera behaviors is still a challenge and requires non-technical users to edit a large number of unintuitive control parameters. In this work we develop a data-driven framework that enables editing of these complex camera positioning parameters in a semantic space (e.g. calm, enjoyable, establishing). First, we generate a database of video clips with a diverse range of shots in a photo-realistic simulator, and use hundreds of participants in a crowd-sourcing framework to obtain scores for a set of semantic descriptors for each clip. Next, we analyze correlations between descriptors and build a semantic control space based on cinematography guidelines and human perception studies. Finally, we learn a generative model that can map a set of desired semantic video descriptors into low-level camera trajectory parameters. We evaluate our system by demonstrating that our model successfully generates shots that are rated by participants as having the expected degrees of expression for each descriptor. We also show that our models generalize to different scenes in both simulation and real-world experiments. Supplementary video: https://youtu.be/6WX2yEUE9_k
The introduction of fully-actuated multirotors has opened the door to new possibilities and more efficient solutions to many real-world applications. However, their integration had been slower than expected, partly due to the need for new tools to take full advantage of these robots. As far as we know, all the groups currently working on the fully-actuated multirotors develop new full-pose (6-D) tools and methods to use their robots, which is inefficient, time-consuming, and requires many resources. We propose methods that extend the existing flight controllers to support the new fully-actuated robots and bridge the gap between the tools already available for underactuated robots and the new fully-actuated vehicles. We introduce attitude strategies that work with the underactuated planners, controllers, tools, and remote control interfaces, all while allowing taking advantage of the full actuation. Moreover, new methods are proposed that can properly handle the limited lateral thrust suffered by many fully-actuated UAV designs. The strategies are lightweight, simple, and allow rapid integration of the available tools with these new vehicles for the fast development of new real-world applications. The real experiments on our robots and simulations on several UAV architectures show how the strategies can be utilized. The source code of the PX4 firmware enhanced with the proposed methods and its simulator with our fully-actuated hexarotor model are provided with this paper. For more information, please visit https://theairlab.org/fully-actuated/.
Aerial cinematography is significantly expanding the capabilities of film-makers. Recent progress in autonomous unmanned aerial vehicles (UAVs) has further increased the potential impact of aerial cameras, with systems that can safely track actors in unstructured cluttered environments. Professional productions, however, require the use of multiple cameras simultaneously to record different viewpoints of the same scene, which are edited into the final footage either in real time or in post-production. Such extreme motion coordination is particularly hard for unscripted action scenes, which are a common use case of aerial cameras. In this work we develop a real-time multi-UAV coordination system that is capable of recording dynamic targets while maximizing shot diversity and avoiding collisions and mutual visibility between cameras. We validate our approach in multiple cluttered environments of a photo-realistic simulator, and deploy the system using two UAVs in real-world experiments. We show that our coordination scheme has low computational cost and takes only 1.17 ms on average to plan for a team of 3 UAVs over a 10 s time horizon. Supplementary video: https://youtu.be/m2R3anv2ADE
Line segment detection is essential for high-level tasks in computer vision and robotics. Currently, most stateof-the-art (SOTA) methods are dedicated to detecting straight line segments in undistorted pinhole images, thus distortions on fisheye or spherical images may largely degenerate their performance. Targeting at the unified line segment detection (ULSD) for both distorted and undistorted images, we propose to represent line segments with the Bezier curve model. Then the line segment detection is tackled by the Bezier curve regression with an end-to-end network, which is model-free and without any undistortion preprocessing. Experimental results on the pinhole, fisheye, and spherical image datasets validate the superiority of the proposed ULSD to the SOTA methods both in accuracy and efficiency (40.6fps for pinhole images). The source code is available at https://github.com/lh9171338/Unified-LineSegment-Detection.
We present the first learning-based visual odometry (VO) model, which generalizes to multiple datasets and real-world scenarios and outperforms geometry-based methods in challenging scenes. We achieve this by leveraging the SLAM dataset TartanAir, which provides a large amount of diverse synthetic data in challenging environments. Furthermore, to make our VO model generalize across datasets, we propose an up-to-scale loss function and incorporate the camera intrinsic parameters into the model. Experiments show that a single model, TartanVO, trained only on synthetic data, without any finetuning, can be generalized to real-world datasets such as KITTI and EuRoC, demonstrating significant advantages over the geometry-based methods on challenging trajectories. Our code is available at https://github.com/castacks/tartanvo.
In this paper, we propose a novel laser-inertial odometry and mapping method to achieve real-time, low-drift and robust pose estimation in large-scale highway environments. The proposed method is mainly composed of four sequential modules, namely scan pre-processing module, dynamic object detection module, laser-inertial odometry module and laser mapping module. Scan pre-processing module uses inertial measurements to compensate the motion distortion of each laser scan. Then, the dynamic object detection module is used to detect and remove dynamic objects from each laser scan by applying CNN segmentation network. After obtaining the undistorted point cloud without moving objects, the laser inertial odometry module uses an Error State Kalman Filter to fuse the data of laser and IMU and output the coarse pose estimation at high frequency. Finally, the laser mapping module performs a fine processing step and the "Frame-to-Model" scan matching strategy is used to create a static global map. We compare the performance of our method with two state-ofthe-art methods, LOAM and SuMa, using KITTI dataset and real highway scene dataset. Experiment results show that our method performs better than the state-of-the-art methods in real highway environments and achieves competitive accuracy on the KITTI dataset.
Graph neural networks are powerful models for many graph-structured tasks. In this paper, we aim to solve the problem of lifelong learning for graph neural networks. One of the main challenges is the effect of "catastrophic forgetting" for continuously learning a sequence of tasks, as the nodes can only be present to the model once. Moreover, the number of nodes changes dynamically in lifelong learning and this makes many graph models and sampling strategies inapplicable. To solve these problems, we construct a new graph topology, called the feature graph. It takes features as new nodes and turns nodes into independent graphs. This successfully converts the original problem of node classification to graph classification. In this way, the increasing nodes in lifelong learning can be regarded as increasing training samples, which makes lifelong learning easier. We demonstrate that the feature graph achieves much higher accuracy than the state-of-the-art methods in both data-incremental and class-incremental tasks. We expect that the feature graph will have broad potential applications for graph-structured tasks in lifelong learning.
In this paper, we aim to solve the problem of interesting scene prediction for mobile robots. This area is currently under explored but is crucial for many practical applications such as autonomous exploration and decision making. First, we expect a robot to detect novel and interesting scenes in unknown environments and lose interests over time after repeatedly observing similar objects. Second, we expect the robots to learn from unbalanced data in a short time, as the robots normally only know the uninteresting scenes before they are deployed. Inspired by those industrial demands, we first propose a novel translation-invariant visual memory for recalling and identifying interesting scenes, then design a three-stage architecture of long-term, short-term, and online learning for human-like experience, environmental knowledge, and online adaption, respectively. It is demonstrated that our approach is able to learn online and find interesting scenes for practical exploration tasks. It also achieves a much higher accuracy than the state-of-the-art algorithm on very challenging robotic interestingness prediction datasets.
Light-weight camera localization in existing maps is essential for vision-based navigation. Currently, visual and visual-inertial odometry (VO\&VIO) techniques are well-developed for state estimation but with inevitable accumulated drifts and pose jumps upon loop closure. To overcome these problems, we propose an efficient monocular camera localization method in prior LiDAR maps using directly estimated 2D-3D line correspondences. To handle the appearance differences and modality gaps between untextured point clouds and images, geometric 3D lines are extracted offline from LiDAR maps while robust 2D lines are extracted online from video sequences. With the pose prediction from VIO, we can efficiently obtain coarse 2D-3D line correspondences. After that, the camera poses and 2D-3D correspondences are iteratively optimized by minimizing the projection error of correspondences and rejecting outliers. The experiment results on the EurocMav dataset and our collected dataset demonstrate that the proposed method can efficiently estimate camera poses without accumulated drifts or pose jumps in urban environments. The code and our collected data are available at https://github.com/levenberg/2D-3D-pose-tracking.