Alert button
Picture for Xibin Song

Xibin Song

Alert button

RGB-based Category-level Object Pose Estimation via Decoupled Metric Scale Recovery

Sep 19, 2023
Jiaxin Wei, Xibin Song, Weizhe Liu, Laurent Kneip, Hongdong Li, Pan Ji

Figure 1 for RGB-based Category-level Object Pose Estimation via Decoupled Metric Scale Recovery
Figure 2 for RGB-based Category-level Object Pose Estimation via Decoupled Metric Scale Recovery
Figure 3 for RGB-based Category-level Object Pose Estimation via Decoupled Metric Scale Recovery
Figure 4 for RGB-based Category-level Object Pose Estimation via Decoupled Metric Scale Recovery

While showing promising results, recent RGB-D camera-based category-level object pose estimation methods have restricted applications due to the heavy reliance on depth sensors. RGB-only methods provide an alternative to this problem yet suffer from inherent scale ambiguity stemming from monocular observations. In this paper, we propose a novel pipeline that decouples the 6D pose and size estimation to mitigate the influence of imperfect scales on rigid transformations. Specifically, we leverage a pre-trained monocular estimator to extract local geometric information, mainly facilitating the search for inlier 2D-3D correspondence. Meanwhile, a separate branch is designed to directly recover the metric scale of the object based on category-level statistics. Finally, we advocate using the RANSAC-P$n$P algorithm to robustly solve for 6D object pose. Extensive experiments have been conducted on both synthetic and real datasets, demonstrating the superior performance of our method over previous state-of-the-art RGB-based approaches, especially in terms of rotation accuracy.

Viaarxiv icon

Digging Into Uncertainty-based Pseudo-label for Robust Stereo Matching

Jul 31, 2023
Zhelun Shen, Xibin Song, Yuchao Dai, Dingfu Zhou, Zhibo Rao, Liangjun Zhang

Figure 1 for Digging Into Uncertainty-based Pseudo-label for Robust Stereo Matching
Figure 2 for Digging Into Uncertainty-based Pseudo-label for Robust Stereo Matching
Figure 3 for Digging Into Uncertainty-based Pseudo-label for Robust Stereo Matching
Figure 4 for Digging Into Uncertainty-based Pseudo-label for Robust Stereo Matching

Due to the domain differences and unbalanced disparity distribution across multiple datasets, current stereo matching approaches are commonly limited to a specific dataset and generalize poorly to others. Such domain shift issue is usually addressed by substantial adaptation on costly target-domain ground-truth data, which cannot be easily obtained in practical settings. In this paper, we propose to dig into uncertainty estimation for robust stereo matching. Specifically, to balance the disparity distribution, we employ a pixel-level uncertainty estimation to adaptively adjust the next stage disparity searching space, in this way driving the network progressively prune out the space of unlikely correspondences. Then, to solve the limited ground truth data, an uncertainty-based pseudo-label is proposed to adapt the pre-trained model to the new domain, where pixel-level and area-level uncertainty estimation are proposed to filter out the high-uncertainty pixels of predicted disparity maps and generate sparse while reliable pseudo-labels to align the domain gap. Experimentally, our method shows strong cross-domain, adapt, and joint generalization and obtains \textbf{1st} place on the stereo task of Robust Vision Challenge 2020. Additionally, our uncertainty-based pseudo-labels can be extended to train monocular depth estimation networks in an unsupervised way and even achieves comparable performance with the supervised methods. The code will be available at https://github.com/gallenszl/UCFNet.

* Accepted by TPAMI 
Viaarxiv icon

A Representation Separation Perspective to Correspondences-free Unsupervised 3D Point Cloud Registration

Mar 24, 2022
Zhiyuan Zhang, Jiadai Sun, Yuchao Dai, Dingfu Zhou, Xibin Song, Mingyi He

Figure 1 for A Representation Separation Perspective to Correspondences-free Unsupervised 3D Point Cloud Registration
Figure 2 for A Representation Separation Perspective to Correspondences-free Unsupervised 3D Point Cloud Registration
Figure 3 for A Representation Separation Perspective to Correspondences-free Unsupervised 3D Point Cloud Registration
Figure 4 for A Representation Separation Perspective to Correspondences-free Unsupervised 3D Point Cloud Registration

3D point cloud registration in remote sensing field has been greatly advanced by deep learning based methods, where the rigid transformation is either directly regressed from the two point clouds (correspondences-free approaches) or computed from the learned correspondences (correspondences-based approaches). Existing correspondences-free methods generally learn the holistic representation of the entire point cloud, which is fragile for partial and noisy point clouds. In this paper, we propose a correspondences-free unsupervised point cloud registration (UPCR) method from the representation separation perspective. First, we model the input point cloud as a combination of pose-invariant representation and pose-related representation. Second, the pose-related representation is used to learn the relative pose wrt a "latent canonical shape" for the source and target point clouds respectively. Third, the rigid transformation is obtained from the above two learned relative poses. Our method not only filters out the disturbance in pose-invariant representation but also is robust to partial-to-partial point clouds or noise. Experiments on benchmark datasets demonstrate that our unsupervised method achieves comparable if not better performance than state-of-the-art supervised registration methods.

* Accepted by IEEE Geoscience and Remote Sensing Letters 
Viaarxiv icon

End-to-end Learning the Partial Permutation Matrix for Robust 3D Point Cloud Registration

Oct 28, 2021
Zhiyuan Zhang, Jiadai Sun, Yuchao Dai, Dingfu Zhou, Xibin Song, Mingyi He

Figure 1 for End-to-end Learning the Partial Permutation Matrix for Robust 3D Point Cloud Registration
Figure 2 for End-to-end Learning the Partial Permutation Matrix for Robust 3D Point Cloud Registration
Figure 3 for End-to-end Learning the Partial Permutation Matrix for Robust 3D Point Cloud Registration
Figure 4 for End-to-end Learning the Partial Permutation Matrix for Robust 3D Point Cloud Registration

Even though considerable progress has been made in deep learning-based 3D point cloud processing, how to obtain accurate correspondences for robust registration remains a major challenge because existing hard assignment methods cannot deal with outliers naturally. Alternatively, the soft matching-based methods have been proposed to learn the matching probability rather than hard assignment. However, in this paper, we prove that these methods have an inherent ambiguity causing many deceptive correspondences. To address the above challenges, we propose to learn a partial permutation matching matrix, which does not assign corresponding points to outliers, and implements hard assignment to prevent ambiguity. However, this proposal poses two new problems, i.e., existing hard assignment algorithms can only solve a full rank permutation matrix rather than a partial permutation matrix, and this desired matrix is defined in the discrete space, which is non-differentiable. In response, we design a dedicated soft-to-hard (S2H) matching procedure within the registration pipeline consisting of two steps: solving the soft matching matrix (S-step) and projecting this soft matrix to the partial permutation matrix (H-step). Specifically, we augment the profit matrix before the hard assignment to solve an augmented permutation matrix, which is cropped to achieve the final partial permutation matrix. Moreover, to guarantee end-to-end learning, we supervise the learned partial permutation matrix but propagate the gradient to the soft matrix instead. Our S2H matching procedure can be easily integrated with existing registration frameworks, which has been verified in representative frameworks including DCP, RPMNet, and DGR. Extensive experiments have validated our method, which creates a new state-of-the-art performance for robust 3D point cloud registration. The code will be made public.

Viaarxiv icon

Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation

Aug 17, 2021
Lina Liu, Xibin Song, Mengmeng Wang, Yong Liu, Liangjun Zhang

Figure 1 for Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation
Figure 2 for Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation
Figure 3 for Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation
Figure 4 for Self-supervised Monocular Depth Estimation for All Day Images using Domain Separation

Remarkable results have been achieved by DCNN based self-supervised depth estimation approaches. However, most of these approaches can only handle either day-time or night-time images, while their performance degrades for all-day images due to large domain shift and the variation of illumination between day and night images. To relieve these limitations, we propose a domain-separated network for self-supervised depth estimation of all-day images. Specifically, to relieve the negative influence of disturbing terms (illumination, etc.), we partition the information of day and night image pairs into two complementary sub-spaces: private and invariant domains, where the former contains the unique information (illumination, etc.) of day and night images and the latter contains essential shared information (texture, etc.). Meanwhile, to guarantee that the day and night images contain the same information, the domain-separated network takes the day-time images and corresponding night-time images (generated by GAN) as input, and the private and invariant feature extractors are learned by orthogonality and similarity loss, where the domain gap can be alleviated, thus better depth maps can be expected. Meanwhile, the reconstruction and photometric losses are utilized to estimate complementary information and depth maps effectively. Experimental results demonstrate that our approach achieves state-of-the-art depth estimation results for all-day images on the challenging Oxford RobotCar dataset, proving the superiority of our proposed approach.

* Accepted by ICCV 2021 
Viaarxiv icon

MapFusion: A General Framework for 3D Object Detection with HDMaps

Mar 10, 2021
Jin Fang, Dingfu Zhou, Xibin Song, Liangjun Zhang

Figure 1 for MapFusion: A General Framework for 3D Object Detection with HDMaps
Figure 2 for MapFusion: A General Framework for 3D Object Detection with HDMaps
Figure 3 for MapFusion: A General Framework for 3D Object Detection with HDMaps
Figure 4 for MapFusion: A General Framework for 3D Object Detection with HDMaps

3D object detection is a key perception component in autonomous driving. Most recent approaches are based on Lidar sensors only or fused with cameras. Maps (e.g., High Definition Maps), a basic infrastructure for intelligent vehicles, however, have not been well exploited for boosting object detection tasks. In this paper, we propose a simple but effective framework - MapFusion to integrate the map information into modern 3D object detector pipelines. In particular, we design a FeatureAgg module for HD Map feature extraction and fusion, and a MapSeg module as an auxiliary segmentation head for the detection backbone. Our proposed MapFusion is detector independent and can be easily integrated into different detectors. The experimental results of three different baselines on large public autonomous driving dataset demonstrate the superiority of the proposed framework. By fusing the map information, we can achieve 1.27 to 2.79 points improvements for mean Average Precision (mAP) on three strong 3d object detection baselines.

Viaarxiv icon

IAFA: Instance-aware Feature Aggregation for 3D Object Detection from a Single Image

Mar 05, 2021
Dingfu Zhou, Xibin Song, Yuchao Dai, Junbo Yin, Feixiang Lu, Jin Fang, Miao Liao, Liangjun Zhang

Figure 1 for IAFA: Instance-aware Feature Aggregation for 3D Object Detection from a Single Image
Figure 2 for IAFA: Instance-aware Feature Aggregation for 3D Object Detection from a Single Image
Figure 3 for IAFA: Instance-aware Feature Aggregation for 3D Object Detection from a Single Image
Figure 4 for IAFA: Instance-aware Feature Aggregation for 3D Object Detection from a Single Image

3D object detection from a single image is an important task in Autonomous Driving (AD), where various approaches have been proposed. However, the task is intrinsically ambiguous and challenging as single image depth estimation is already an ill-posed problem. In this paper, we propose an instance-aware approach to aggregate useful information for improving the accuracy of 3D object detection with the following contributions. First, an instance-aware feature aggregation (IAFA) module is proposed to collect local and global features for 3D bounding boxes regression. Second, we empirically find that the spatial attention module can be well learned by taking coarse-level instance annotations as a supervision signal. The proposed module has significantly boosted the performance of the baseline method on both 3D detection and 2D bird-eye's view of vehicle detection among all three categories. Third, our proposed method outperforms all single image-based approaches (even these methods trained with depth as auxiliary inputs) and achieves state-of-the-art 3D detection performance on the KITTI benchmark.

* Accepted by ACCV2020 
Viaarxiv icon

FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion

Dec 15, 2020
Lina Liu, Xibin Song, Xiaoyang Lyu, Junwei Diao, Mengmeng Wang, Yong Liu, Liangjun Zhang

Figure 1 for FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion
Figure 2 for FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion
Figure 3 for FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion
Figure 4 for FCFR-Net: Feature Fusion based Coarse-to-Fine Residual Learning for Monocular Depth Completion

Depth completion aims to recover a dense depth map from a sparse depth map with the corresponding color image as input. Recent approaches mainly formulate the depth completion as a one-stage end-to-end learning task, which outputs dense depth maps directly. However, the feature extraction and supervision in one-stage frameworks are insufficient, limiting the performance of these approaches. To address this problem, we propose a novel end-to-end residual learning framework, which formulates the depth completion as a two-stage learning task, i.e., a sparse-to-coarse stage and a coarse-to-fine stage. First, a coarse dense depth map is obtained by a simple CNN framework. Then, a refined depth map is further obtained using a residual learning strategy in the coarse-to-fine stage with coarse depth map and color image as input. Specially, in the coarse-to-fine stage, a channel shuffle extraction operation is utilized to extract more representative features from color image and coarse depth map, and an energy based fusion operation is exploited to effectively fuse these features obtained by channel shuffle operation, thus leading to more accurate and refined depth maps. We achieve SoTA performance in RMSE on KITTI benchmark. Extensive experiments on other datasets future demonstrate the superiority of our approach over current state-of-the-art depth completion approaches.

* 9 pages, 5 figures. Accepted to 35th AAAI Conference on Artificial Intelligence (AAAI 2021) 
Viaarxiv icon

AES: Autonomous Excavator System for Real-World and Hazardous Environments

Nov 10, 2020
Jinxin Zhao, Pinxin Long, Liyang Wang, Lingfeng Qian, Feixiang Lu, Xibin Song, Dinesh Manocha, Liangjun Zhang

Figure 1 for AES: Autonomous Excavator System for Real-World and Hazardous Environments
Figure 2 for AES: Autonomous Excavator System for Real-World and Hazardous Environments
Figure 3 for AES: Autonomous Excavator System for Real-World and Hazardous Environments
Figure 4 for AES: Autonomous Excavator System for Real-World and Hazardous Environments

Excavators are widely used for material-handling applications in unstructured environments, including mining and construction. The size of the global market of excavators is 44.12 Billion USD in 2018 and is predicted to grow to 63.14 Billion USD by 2026. Operating excavators in a real-world environment can be challenging due to extreme conditions and rock sliding, ground collapse, or exceeding dust. Multiple fatalities and injuries occur each year during excavations. An autonomous excavator that can substitute human operators in these hazardous environments would substantially lower the number of injuries and can improve the overall productivity.

Viaarxiv icon

PerMO: Perceiving More at Once from a Single Image for Autonomous Driving

Jul 16, 2020
Feixiang Lu, Zongdai Liu, Xibin Song, Dingfu Zhou, Wei Li, Hui Miao, Miao Liao, Liangjun Zhang, Bin Zhou, Ruigang Yang, Dinesh Manocha

Figure 1 for PerMO: Perceiving More at Once from a Single Image for Autonomous Driving
Figure 2 for PerMO: Perceiving More at Once from a Single Image for Autonomous Driving
Figure 3 for PerMO: Perceiving More at Once from a Single Image for Autonomous Driving
Figure 4 for PerMO: Perceiving More at Once from a Single Image for Autonomous Driving

We present a novel approach to detect, segment, and reconstruct complete textured 3D models of vehicles from a single image for autonomous driving. Our approach combines the strengths of deep learning and the elegance of traditional techniques from part-based deformable model representation to produce high-quality 3D models in the presence of severe occlusions. We present a new part-based deformable vehicle model that is used for instance segmentation and automatically generate a dataset that contains dense correspondences between 2D images and 3D models. We also present a novel end-to-end deep neural network to predict dense 2D/3D mapping and highlight its benefits. Based on the dense mapping, we are able to compute precise 6-DoF poses and 3D reconstruction results at almost interactive rates on a commodity GPU. We have integrated these algorithms with an autonomous driving system. In practice, our method outperforms the state-of-the-art methods for all major vehicle parsing tasks: 2D instance segmentation by 4.4 points (mAP), 6-DoF pose estimation by 9.11 points, and 3D detection by 1.37. Moreover, we have released all of the source code, dataset, and the trained model on Github.

Viaarxiv icon