Monitoring plants and fruits at high resolution play a key role in the future of agriculture. Accurate 3D information can pave the way to a diverse number of robotic applications in agriculture ranging from autonomous harvesting to precise yield estimation. Obtaining such 3D information is non-trivial as agricultural environments are often repetitive and cluttered, and one has to account for the partial observability of fruit and plants. In this paper, we address the problem of jointly estimating complete 3D shapes of fruit and their pose in a 3D multi-resolution map built by a mobile robot. To this end, we propose an online multi-resolution panoptic mapping system where regions of interest are represented with a higher resolution. We exploit data to learn a general fruit shape representation that we use at inference time together with an occlusion-aware differentiable rendering pipeline to complete partial fruit observations and estimate the 7 DoF pose of each fruit in the map. The experiments presented in this paper, evaluated both in the controlled environment and in a commercial greenhouse, show that our novel algorithm yields higher completion and pose estimation accuracy than existing methods, with an improvement of 41% in completion accuracy and 52% in pose estimation accuracy while keeping a low inference time of 0.6s in average.
Scene understanding is crucial for autonomous robots in dynamic environments for making future state predictions, avoiding collisions, and path planning. Camera and LiDAR perception made tremendous progress in recent years, but face limitations under adverse weather conditions. To leverage the full potential of multi-modal sensor suites, radar sensors are essential for safety critical tasks and are already installed in most new vehicles today. In this paper, we address the problem of semantic segmentation of moving objects in radar point clouds to enhance the perception of the environment with another sensor modality. Instead of aggregating multiple scans to densify the point clouds, we propose a novel approach based on the self-attention mechanism to accurately perform sparse, single-scan segmentation. Our approach, called Gaussian Radar Transformer, includes the newly introduced Gaussian transformer layer, which replaces the softmax normalization by a Gaussian function to decouple the contribution of individual points. To tackle the challenge of the transformer to capture long-range dependencies, we propose our attentive up- and downsampling modules to enlarge the receptive field and capture strong spatial relations. We compare our approach to other state-of-the-art methods on the RadarScenes data set and show superior segmentation quality in diverse environments, even without exploiting temporal information.
Nano-size unmanned aerial vehicles (UAVs) hold enormous potential to perform autonomous operations in complex environments, such as inspection, monitoring or data collection. Moreover, their small size allows safe operation close to humans and agile flight. An important part of autonomous flight is localization, which is a computationally intensive task especially on a nano-UAV that usually has strong constraints in sensing, processing and memory. This work presents a real-time localization approach with low element-count multizone range sensors for resource-constrained nano-UAVs. The proposed approach is based on a novel miniature 64-zone time-of-flight sensor from ST Microelectronics and a RISC-V-based parallel ultra low-power processor, to enable accurate and low latency Monte Carlo Localization on-board. Experimental evaluation using a nano-UAV open platform demonstrated that the proposed solution is capable of localizing on a 31.2m$\boldsymbol{^2}$ map with 0.15m accuracy and an above 95% success rate. The achieved accuracy is sufficient for localization in common indoor environments. We analyze tradeoffs in using full and half-precision floating point numbers as well as a quantized map and evaluate the accuracy and memory footprint across the design space. Experimental evaluation shows that parallelizing the execution for 8 RISC-V cores brings a 7x speedup and allows us to execute the algorithm on-board in real-time with a latency of 0.2-30ms (depending on the number of particles), while only increasing the overall drone power consumption by 3-7%. Finally, we provide an open-source implementation of our approach.
A reliable pose estimator robust to environmental disturbances is desirable for mobile robots. To this end, inertial measurement units (IMUs) play an important role because they can perceive the full motion state of the vehicle independently. However, it suffers from accumulative error due to inherent noise and bias instability, especially for low-cost sensors. In our previous studies on Wheel-INS \cite{niu2021, wu2021}, we proposed to limit the error drift of the pure inertial navigation system (INS) by mounting an IMU to the wheel of the robot to take advantage of rotation modulation. However, it still drifted over a long period of time due to the lack of external correction signals. In this letter, we propose to exploit the environmental perception ability of Wheel-INS to achieve simultaneous localization and mapping (SLAM) with only one IMU. To be specific, we use the road bank angles (mirrored by the robot roll angles estimated by Wheel-INS) as terrain features to enable the loop closure with a Rao-Blackwellized particle filter. The road bank angle is sampled and stored according to the robot position in the grid maps maintained by the particles. The weights of the particles are updated according to the difference between the currently estimated roll sequence and the terrain map. Field experiments suggest the feasibility of the idea to perform SLAM in Wheel-INS using the robot roll angle estimates. In addition, the positioning accuracy is improved significantly (more than 30\%) over Wheel-INS. Source code of our implementation is publicly available (https://github.com/i2Nav-WHU/Wheel-SLAM).
Plant phenotyping is a central task in agriculture, as it describes plants' growth stage, development, and other relevant quantities. Robots can help automate this process by accurately estimating plant traits such as the number of leaves, leaf area, and the plant size. In this paper, we address the problem of joint semantic, plant instance, and leaf instance segmentation of crop fields from RGB data. We propose a single convolutional neural network that addresses the three tasks simultaneously, exploiting their underlying hierarchical structure. We introduce task-specific skip connections, which our experimental evaluation proves to be more beneficial than the usual schemes. We also propose a novel automatic post-processing, which explicitly addresses the problem of spatially close instances, common in the agricultural domain because of overlapping leaves. Our architecture simultaneously tackles these problems jointly in the agricultural context. Previous works either focus on plant or leaf segmentation, or do not optimise for semantic segmentation. Results show that our system has superior performance to state-of-the-art approaches, while having a reduced number of parameters and is operating at camera frame rate.
Determining the state of a mobile robot is an essential building block of robot navigation systems. In this paper, we address the problem of estimating the robots pose in an indoor environment using 2D LiDAR data and investigate how modern environment models can improve gold standard Monte-Carlo localization (MCL) systems. We propose a neural occupancy field (NOF) to implicitly represent the scene using a neural network. With the pretrained network, we can synthesize 2D LiDAR scans for an arbitrary robot pose through volume rendering. Based on the implicit representation, we can obtain the similarity between a synthesized and actual scan as an observation model and integrate it into an MCL system to perform accurate localization. We evaluate our approach on five sequences of a self-recorded dataset and three publicly available datasets. We show that we can accurately and efficiently localize a robot using our approach surpassing the localization performance of state-of-the-art methods. The experiments suggest that the presented implicit representation is able to predict more accurate 2D LiDAR scans leading to an improved observation model for our particle filter-based localization. The code of our approach is released at: https://github.com/PRBonn/ir-mcl.
Perception is crucial for robots that act in real-world environments, as autonomous systems need to see and understand the world around them to act appropriately. Panoptic segmentation provides an interpretation of the scene by computing a pixel-wise semantic label together with instance IDs. In this paper, we address panoptic segmentation using RGB-D data of indoor scenes. We propose a novel encoder-decoder neural network that processes RGB and depth separately through two encoders. The features of the individual encoders are progressively merged at different resolutions, such that the RGB features are enhanced using complementary depth information. We propose a novel merging approach called ResidualExcite, which reweighs each entry of the feature map according to its importance. With our double-encoder architecture, we are robust to missing cues. In particular, the same model can train and infer on RGB-D, RGB-only, and depth-only input data, without the need to train specialized models. We evaluate our method on publicly available datasets and show that our approach achieves superior results compared to other common approaches for panoptic segmentation.
Accurate mapping of large-scale environments is an essential building block of most outdoor autonomous systems. Challenges of traditional mapping methods include the balance between memory consumption and mapping accuracy. This paper addresses the problems of achieving large-scale 3D reconstructions with implicit representations using 3D LiDAR measurements. We learn and store implicit features through an octree-based hierarchical structure, which is sparse and extensible. The features can be turned into signed distance values through a shallow neural network. We leverage binary cross entropy loss to optimize the local features with the 3D measurements as supervision. Based on our implicit representation, we design an incremental mapping system with regularization to tackle the issue of catastrophic forgetting in continual learning. Our experiments show that our 3D reconstructions are more accurate, complete, and memory-efficient than current state-of-the-art 3D mapping methods.
Lifelong localization in a given map is an essential capability for autonomous service robots. In this paper, we consider the task of long-term localization in a changing indoor environment given sparse CAD floor plans. The commonly used pre-built maps from the robot sensors may increase the cost and time of deployment. Furthermore, their detailed nature requires that they are updated when significant changes occur. We address the difficulty of localization when the correspondence between the map and the observations is low due to the sparsity of the CAD map and the changing environment. To overcome both challenges, we propose to exploit semantic cues that are commonly present in human-oriented spaces. These semantic cues can be detected using RGB cameras by utilizing object detection, and are matched against an easy-to-update, abstract semantic map. The semantic information is integrated into a Monte Carlo localization framework using a particle filter that operates on 2D LiDAR scans and camera data. We provide a long-term localization solution and a semantic map format, for environments that undergo changes to their interior structure and detailed geometric maps are not available. We evaluate our localization framework on multiple challenging indoor scenarios in an office environment, taken weeks apart. The experiments suggest that our approach is robust to structural changes and can run on an onboard computer. We released the open source implementation of our approach written in C++ together with a ROS wrapper.
Robust and accurate pose estimation of a robotic platform, so-called sensor-based odometry, is an essential part of many robotic applications. While many sensor odometry systems made progress by adding more complexity to the ego-motion estimation process, we move in the opposite direction. By removing a majority of parts and focusing on the core elements, we obtain a surprisingly effective system that is simple to realize and can operate under various environmental conditions using different LiDAR sensors. Our odometry estimation approach relies on point-to-point ICP combined with adaptive thresholding for correspondence matching, a robust kernel, a simple but widely applicable motion compensation approach, and a point cloud subsampling strategy. This yields a system with only a few parameters that in most cases do not even have to be tuned to a specific LiDAR sensor. Our system using the same parameters performs on par with state-of-the-art methods under various operating conditions using different platforms: automotive platforms, UAV-based operation, vehicles like segways, or handheld LiDARs. We do not require integrating IMU information and solely rely on 3D point cloud data obtained from a wide range of 3D LiDAR sensors, thus, enabling a broad spectrum of different applications and operating conditions. Our open-source system operates faster than the sensor frame rate in all presented datasets and is designed for real-world scenarios.