Scene flow represents the motion of points in the 3D space, which is the counterpart of the optical flow that represents the motion of pixels in the 2D image. However, it is difficult to obtain the ground truth of scene flow in the real scenes, and recent studies are based on synthetic data for training. Therefore, how to train a scene flow network with unsupervised methods based on real-world data shows crucial significance. A novel unsupervised learning method for scene flow is proposed in this paper, which utilizes the images of two consecutive frames taken by monocular camera without the ground truth of scene flow for training. Our method realizes the goal that training scene flow network with real-world data, which bridges the gap between training data and test data and broadens the scope of available data for training. Unsupervised learning of scene flow in this paper mainly consists of two parts: (i) depth estimation and camera pose estimation, and (ii) scene flow estimation based on four different loss functions. Depth estimation and camera pose estimation obtain the depth maps and camera pose between two consecutive frames, which provide further information for the next scene flow estimation. After that, we used depth consistency loss, dynamic-static consistency loss, Chamfer loss, and Laplacian regularization loss to carry out unsupervised training of the scene flow network. To our knowledge, this is the first paper that realizes the unsupervised learning of 3D scene flow from monocular camera. The experiment results on KITTI show that our method for unsupervised learning of scene flow meets great performance compared to traditional methods Iterative Closest Point (ICP) and Fast Global Registration (FGR). The source code is available at: https://github.com/IRMVLab/3DUnMonoFlow.
Multiple object tracking (MOT) is a significant task in achieving autonomous driving. Traditional works attempt to complete this task, either based on point clouds (PC) collected by LiDAR, or based on images captured from cameras. However, relying on one single sensor is not robust enough, because it might fail during the tracking process. On the other hand, feature fusion from multiple modalities contributes to the improvement of accuracy. As a result, new techniques based on different sensors integrating features from multiple modalities are being developed. Texture information from RGB cameras and 3D structure information from Lidar have respective advantages under different circumstances. However, it's not easy to achieve effective feature fusion because of completely distinct information modalities. Previous fusion methods usually fuse the top-level features after the backbones extract the features from different modalities. In this paper, we first introduce PointNet++ to obtain multi-scale deep representations of point cloud to make it adaptive to our proposed Interactive Feature Fusion between multi-scale features of images and point clouds. Specifically, through multi-scale interactive query and fusion between pixel-level and point-level features, our method, can obtain more distinguishing features to improve the performance of multiple object tracking. Besides, we explore the effectiveness of pre-training on each single modality and fine-tuning on the fusion-based model. The experimental results demonstrate that our method can achieve good performance on the KITTI benchmark and outperform other approaches without using multi-scale feature fusion. Moreover, the ablation studies indicates the effectiveness of multi-scale feature fusion and pre-training on single modality.
3D Multi-Object Tracking (MOT) is an important part of the unmanned vehicle perception module. Most methods optimize object detection and data association independently. These methods make the network structure complicated and limit the improvement of MOT accuracy. we proposed a 3D MOT framework based on simultaneous optimization of object detection and scene flow estimation. In the framework, a detection-guidance scene flow module is proposed to relieve the problem of incorrect inter-frame assocation. For more accurate scene flow label especially in the case of motion with rotation, a box-transformation-based scene flow ground truth calculation method is proposed. Experimental results on the KITTI MOT dataset show competitive results over the state-of-the-arts and the robustness under extreme motion with rotation.
Depth and ego-motion estimations are essential for the localization and navigation of autonomous robots and autonomous driving. Recent studies make it possible to learn the per-pixel depth and ego-motion from the unlabeled monocular video. A novel unsupervised training framework is proposed with 3D hierarchical refinement and augmentation using explicit 3D geometry. In this framework, the depth and pose estimations are hierarchically and mutually coupled to refine the estimated pose layer by layer. The intermediate view image is proposed and synthesized by warping the pixels in an image with the estimated depth and coarse pose. Then, the residual pose transformation can be estimated from the new view image and the image of the adjacent frame to refine the coarse pose. The iterative refinement is implemented in a differentiable manner in this paper, making the whole framework optimized uniformly. Meanwhile, a new image augmentation method is proposed for the pose estimation by synthesizing a new view image, which creatively augments the pose in 3D space but gets a new augmented 2D image. The experiments on KITTI demonstrate that our depth estimation achieves state-of-the-art performance and even surpasses recent approaches that utilize other auxiliary tasks. Our visual odometry outperforms all recent unsupervised monocular learning-based methods and achieves competitive performance to the geometry-based method, ORB-SLAM2 with back-end optimization.
An efficient 3D point cloud learning architecture, named PWCLO-Net, for LiDAR odometry is first proposed in this paper. In this architecture, the projection-aware representation of the 3D point cloud is proposed to organize the raw 3D point cloud into an ordered data form to achieve efficiency. The Pyramid, Warping, and Cost volume (PWC) structure for the LiDAR odometry task is built to estimate and refine the pose in a coarse-to-fine approach hierarchically and efficiently. A projection-aware attentive cost volume is built to directly associate two discrete point clouds and obtain embedding motion patterns. Then, a trainable embedding mask is proposed to weigh the local motion patterns to regress the overall pose and filter outlier points. The trainable pose warp-refinement module is iteratively used with embedding mask optimized hierarchically to make the pose estimation more robust for outliers. The entire architecture is holistically optimized end-to-end to achieve adaptive learning of cost volume and mask, and all operations involving point cloud sampling and grouping are accelerated by projection-aware 3D feature learning methods. The superior performance and effectiveness of our LiDAR odometry architecture are demonstrated on KITTI odometry dataset. Our method outperforms all recent learning-based methods and even the geometry-based approach, LOAM with mapping optimization, on most sequences of KITTI odometry dataset.
Automatic packing of objects is a critical component for efficient shipping in the Industry 4.0 era. Although robots have shown great success in pick-and-place operations with rigid products, the autonomous shaping and packing of elastic materials into compact boxes remains one of the most challenging problems in robotics; The automation of packing tasks is crucial at this moment given the accelerating shift towards e-commerce (which requires to manipulate multiple types of materials). In this paper, we propose a new action planning approach to automatically pack long linear elastic objects into common-size boxes with a bimanual robotic system. For that, we developed an efficient vision-based method to compute the objects' geometry and track its deformation in real-time and without special markers; The algorithm filters and orders the feedback point cloud that is captured by a depth sensor. A reference object model is introduced to plan the manipulation targets and to complete occluded parts of the object. Action primitives are used to construct high-level behaviors, which enable the execution of all packing steps. To validate the proposed theory, we conduct a detailed experimental study with multiple types and lengths of objects and packing boxes. The proposed methodology is original and its demonstrated manipulation capabilities have not (to the best of the authors knowledge) been previously reported in the literature.
Scene flow estimation is the task to predict the point-wise 3D displacement vector between two consecutive frames of point clouds, which has important application in fields such as service robots and autonomous driving. Although many previous works have explored greatly on scene flow estimation based on point clouds, we point out two problems that have not been noticed or well solved before: 1) Points of adjacent frames in repetitive patterns may be wrongly associated due to similar spatial structure in their neighbourhoods; 2) Scene flow between adjacent frames of point clouds with long-distance movement may be inaccurately estimated. To solve the first problem, we propose a novel context-aware set conv layer to exploit contextual structure information of Euclidean space and learn soft aggregation weights for local point features. Our design is inspired by human perception of contextual structure information during scene understanding. We incorporate the context-aware set conv layer in a context-aware point feature pyramid module of 3D point clouds for scene flow estimation. For the second problem, we propose an explicit residual flow learning structure in the residual flow refinement layer to cope with long-distance movement. The experiments and ablation study on FlyingThings3D and KITTI scene flow datasets demonstrate the effectiveness of each proposed component and that we solve problem of ambiguous inter-frame association and long-distance movement estimation. Quantitative results on both FlyingThings3D and KITTI scene flow datasets show that our method achieves state-of-the-art performance, surpassing all other previous works to the best of our knowledge by at least 25%.
Optical flow estimation is a fundamental problem of computer vision and has many applications in the fields of robot learning and autonomous driving. This paper reveals novel geometric laws of optical flow based on the insight and detailed definition of non-occlusion. Then, two novel loss functions are proposed for the unsupervised learning of optical flow based on the geometric laws of non-occlusion. Specifically, after the occlusion part of the images are masked, the flowing process of pixels is carefully considered and geometric constraints are conducted based on the geometric laws of optical flow. First, neighboring pixels in the first frame will not intersect during the pixel displacement to the second frame. Secondly, when the cluster containing adjacent four pixels in the first frame moves to the second frame, no other pixels will flow into the quadrilateral formed by them. According to the two geometrical constraints, the optical flow non-intersection loss and the optical flow non-blocking loss in the non-occlusion regions are proposed. Two loss functions punish the irregular and inexact optical flows in the non-occlusion regions. The experiments on datasets demonstrated that the proposed unsupervised losses of optical flow based on the geometric laws in non-occlusion regions make the estimated optical flow more refined in detail, and improve the performance of unsupervised learning of optical flow. In addition, the experiments training on synthetic data and evaluating on real data show that the generalization ability of optical flow network is improved by our proposed unsupervised approach.
Real-time 3D human pose estimation is crucial for human-computer interaction. It is cheap and practical to estimate 3D human pose only from monocular video. However, recent bone splicing based 3D human pose estimation method brings about the problem of cumulative error. In this paper, the concept of virtual bones is proposed to solve such a challenge. The virtual bones are imaginary bones between non-adjacent joints. They do not exist in reality, but they bring new loop constraints for the estimation of 3D human joints. The proposed network in this paper predicts real bones and virtual bones, simultaneously. The final length of real bones is constrained and learned by the loop constructed by the predicted real bones and virtual bones. Besides, the motion constraints of joints in consecutive frames are considered. The consistency between the 2D projected position displacement predicted by the network and the captured real 2D displacement by the camera is proposed as a new projection consistency loss for the learning of 3D human pose. The experiments on the Human3.6M dataset demonstrate the good performance of the proposed method. Ablation studies demonstrate the effectiveness of the proposed inter-frame projection consistency constraints and intra-frame loop constraints.
A new unsupervised learning method of depth and ego-motion using multiple masks from monocular video is proposed in this paper. The depth estimation network and the ego-motion estimation network are trained according to the constraints of depth and ego-motion without truth values. The main contribution of our method is to carefully consider the occlusion of the pixels generated when the adjacent frames are projected to each other, and the blank problem generated in the projection target imaging plane. Two fine masks are designed to solve most of the image pixel mismatch caused by the movement of the camera. In addition, some relatively rare circumstances are considered, and repeated masking is proposed. To some extent, the method is to use a geometric relationship to filter the mismatched pixels for training, making unsupervised learning more efficient and accurate. The experiments on KITTI dataset show our method achieves good performance in terms of depth and ego-motion. The generalization capability of our method is demonstrated by training on the low-quality uncalibrated bike video dataset and evaluating on KITTI dataset, and the results are still good.