Alert button
Picture for Jin Xie

Jin Xie

Alert button

SGFeat: Salient Geometric Feature for Point Cloud Registration

Sep 12, 2023
Qianliang Wu, Yaqing Ding, Lei Luo, Chuanwei Zhou, Jin Xie, Jian Yang

Figure 1 for SGFeat: Salient Geometric Feature for Point Cloud Registration
Figure 2 for SGFeat: Salient Geometric Feature for Point Cloud Registration
Figure 3 for SGFeat: Salient Geometric Feature for Point Cloud Registration
Figure 4 for SGFeat: Salient Geometric Feature for Point Cloud Registration

Point Cloud Registration (PCR) is a critical and challenging task in computer vision. One of the primary difficulties in PCR is identifying salient and meaningful points that exhibit consistent semantic and geometric properties across different scans. Previous methods have encountered challenges with ambiguous matching due to the similarity among patch blocks throughout the entire point cloud and the lack of consideration for efficient global geometric consistency. To address these issues, we propose a new framework that includes several novel techniques. Firstly, we introduce a semantic-aware geometric encoder that combines object-level and patch-level semantic information. This encoder significantly improves registration recall by reducing ambiguity in patch-level superpoint matching. Additionally, we incorporate a prior knowledge approach that utilizes an intrinsic shape signature to identify salient points. This enables us to extract the most salient super points and meaningful dense points in the scene. Secondly, we introduce an innovative transformer that encodes High-Order (HO) geometric features. These features are crucial for identifying salient points within initial overlap regions while considering global high-order geometric consistency. To optimize this high-order transformer further, we introduce an anchor node selection strategy. By encoding inter-frame triangle or polyhedron consistency features based on these anchor nodes, we can effectively learn high-order geometric features of salient super points. These high-order features are then propagated to dense points and utilized by a Sinkhorn matching module to identify key correspondences for successful registration. In our experiments conducted on well-known datasets such as 3DMatch/3DLoMatch and KITTI, our approach has shown promising results, highlighting the effectiveness of our novel method.

Viaarxiv icon

Implicit Obstacle Map-driven Indoor Navigation Model for Robust Obstacle Avoidance

Aug 24, 2023
Wei Xie, Haobo Jiang, Shuo Gu, Jin Xie

Robust obstacle avoidance is one of the critical steps for successful goal-driven indoor navigation tasks.Due to the obstacle missing in the visual image and the possible missed detection issue, visual image-based obstacle avoidance techniques still suffer from unsatisfactory robustness. To mitigate it, in this paper, we propose a novel implicit obstacle map-driven indoor navigation framework for robust obstacle avoidance, where an implicit obstacle map is learned based on the historical trial-and-error experience rather than the visual image. In order to further improve the navigation efficiency, a non-local target memory aggregation module is designed to leverage a non-local network to model the intrinsic relationship between the target semantic and the target orientation clues during the navigation process so as to mine the most target-correlated object clues for the navigation decision. Extensive experimental results on AI2-Thor and RoboTHOR benchmarks verify the excellent obstacle avoidance and navigation efficiency of our proposed method. The core source code is available at https://github.com/xwaiyy123/object-navigation.

* 9 pages, 7 figures, 43 references. This paper has been accepted for ACM MM 2023 
Viaarxiv icon

DFormer: Diffusion-guided Transformer for Universal Image Segmentation

Jun 08, 2023
Hefeng Wang, Jiale Cao, Rao Muhammad Anwer, Jin Xie, Fahad Shahbaz Khan, Yanwei Pang

Figure 1 for DFormer: Diffusion-guided Transformer for Universal Image Segmentation
Figure 2 for DFormer: Diffusion-guided Transformer for Universal Image Segmentation
Figure 3 for DFormer: Diffusion-guided Transformer for Universal Image Segmentation
Figure 4 for DFormer: Diffusion-guided Transformer for Universal Image Segmentation

This paper introduces an approach, named DFormer, for universal image segmentation. The proposed DFormer views universal image segmentation task as a denoising process using a diffusion model. DFormer first adds various levels of Gaussian noise to ground-truth masks, and then learns a model to predict denoising masks from corrupted masks. Specifically, we take deep pixel-level features along with the noisy masks as inputs to generate mask features and attention masks, employing diffusion-based decoder to perform mask prediction gradually. At inference, our DFormer directly predicts the masks and corresponding categories from a set of randomly-generated masks. Extensive experiments reveal the merits of our proposed contributions on different image segmentation tasks: panoptic segmentation, instance segmentation, and semantic segmentation. Our DFormer outperforms the recent diffusion-based panoptic segmentation method Pix2Seq-D with a gain of 3.6% on MS COCO val2017 set. Further, DFormer achieves promising semantic segmentation performance outperforming the recent diffusion-based method by 2.2% on ADE20K val set. Our source code and models will be publicly on https://github.com/cp3wan/DFormer

* Project website: https://github.com/cp3wan/DFormer 
Viaarxiv icon

Self-Supervised 3D Scene Flow Estimation Guided by Superpoints

May 04, 2023
Yaqi Shen, Le Hui, Jin Xie, Jian Yang

Figure 1 for Self-Supervised 3D Scene Flow Estimation Guided by Superpoints
Figure 2 for Self-Supervised 3D Scene Flow Estimation Guided by Superpoints
Figure 3 for Self-Supervised 3D Scene Flow Estimation Guided by Superpoints
Figure 4 for Self-Supervised 3D Scene Flow Estimation Guided by Superpoints

3D scene flow estimation aims to estimate point-wise motions between two consecutive frames of point clouds. Superpoints, i.e., points with similar geometric features, are usually employed to capture similar motions of local regions in 3D scenes for scene flow estimation. However, in existing methods, superpoints are generated with the offline clustering methods, which cannot characterize local regions with similar motions for complex 3D scenes well, leading to inaccurate scene flow estimation. To this end, we propose an iterative end-to-end superpoint based scene flow estimation framework, where the superpoints can be dynamically updated to guide the point-level flow prediction. Specifically, our framework consists of a flow guided superpoint generation module and a superpoint guided flow refinement module. In our superpoint generation module, we utilize the bidirectional flow information at the previous iteration to obtain the matching points of points and superpoint centers for soft point-to-superpoint association construction, in which the superpoints are generated for pairwise point clouds. With the generated superpoints, we first reconstruct the flow for each point by adaptively aggregating the superpoint-level flow, and then encode the consistency between the reconstructed flow of pairwise point clouds. Finally, we feed the consistency encoding along with the reconstructed flow into GRU to refine point-level flow. Extensive experiments on several different datasets show that our method can achieve promising performance.

* CVPR 2023 
Viaarxiv icon

Transformer-based stereo-aware 3D object detection from binocular images

Apr 24, 2023
Hanqing Sun, Yanwei Pang, Jiale Cao, Jin Xie, Xuelong Li

Figure 1 for Transformer-based stereo-aware 3D object detection from binocular images
Figure 2 for Transformer-based stereo-aware 3D object detection from binocular images
Figure 3 for Transformer-based stereo-aware 3D object detection from binocular images
Figure 4 for Transformer-based stereo-aware 3D object detection from binocular images

Vision Transformers have shown promising progress in various object detection tasks, including monocular 2D/3D detection and surround-view 3D detection. However, when used in essential and classic stereo 3D object detection, directly adopting those surround-view Transformers leads to slow convergence and significant precision drops. We argue that one of the causes of this defect is that the surround-view Transformers do not consider the stereo-specific image correspondence information. In a surround-view system, the overlapping areas are small, and thus correspondence is not a primary issue. In this paper, we explore the model design of vision Transformers in stereo 3D object detection, focusing particularly on extracting and encoding the task-specific image correspondence information. To achieve this goal, we present TS3D, a Transformer-based Stereo-aware 3D object detector. In the TS3D, a Disparity-Aware Positional Encoding (DAPE) model is proposed to embed the image correspondence information into stereo features. The correspondence is encoded as normalized disparity and is used in conjunction with sinusoidal 2D positional encoding to provide the location information of the 3D scene. To extract enriched multi-scale stereo features, we propose a Stereo Reserving Feature Pyramid Network (SRFPN). The SRFPN is designed to reserve the correspondence information while fusing intra-scale and aggregating cross-scale stereo features. Our proposed TS3D achieves a 41.29% Moderate Car detection average precision on the KITTI test set and takes 88 ms to detect objects from each binocular image pair. It is competitive with advanced counterparts in terms of both precision and inference speed.

Viaarxiv icon

Multi-Mode Online Knowledge Distillation for Self-Supervised Visual Representation Learning

Apr 13, 2023
Kaiyou Song, Jin Xie, Shan Zhang, Zimeng Luo

Figure 1 for Multi-Mode Online Knowledge Distillation for Self-Supervised Visual Representation Learning
Figure 2 for Multi-Mode Online Knowledge Distillation for Self-Supervised Visual Representation Learning
Figure 3 for Multi-Mode Online Knowledge Distillation for Self-Supervised Visual Representation Learning
Figure 4 for Multi-Mode Online Knowledge Distillation for Self-Supervised Visual Representation Learning

Self-supervised learning (SSL) has made remarkable progress in visual representation learning. Some studies combine SSL with knowledge distillation (SSL-KD) to boost the representation learning performance of small models. In this study, we propose a Multi-mode Online Knowledge Distillation method (MOKD) to boost self-supervised visual representation learning. Different from existing SSL-KD methods that transfer knowledge from a static pre-trained teacher to a student, in MOKD, two different models learn collaboratively in a self-supervised manner. Specifically, MOKD consists of two distillation modes: self-distillation and cross-distillation modes. Among them, self-distillation performs self-supervised learning for each model independently, while cross-distillation realizes knowledge interaction between different models. In cross-distillation, a cross-attention feature search strategy is proposed to enhance the semantic feature alignment between different models. As a result, the two models can absorb knowledge from each other to boost their representation learning performance. Extensive experimental results on different backbones and datasets demonstrate that two heterogeneous models can benefit from MOKD and outperform their independently trained baseline. In addition, MOKD also outperforms existing SSL-KD methods for both the student and teacher models.

* Accepted by CVPR 2023 
Viaarxiv icon

Hard Patches Mining for Masked Image Modeling

Apr 12, 2023
Haochen Wang, Kaiyou Song, Junsong Fan, Yuxi Wang, Jin Xie, Zhaoxiang Zhang

Figure 1 for Hard Patches Mining for Masked Image Modeling
Figure 2 for Hard Patches Mining for Masked Image Modeling
Figure 3 for Hard Patches Mining for Masked Image Modeling
Figure 4 for Hard Patches Mining for Masked Image Modeling

Masked image modeling (MIM) has attracted much research attention due to its promising potential for learning scalable visual representations. In typical approaches, models usually focus on predicting specific contents of masked patches, and their performances are highly related to pre-defined mask strategies. Intuitively, this procedure can be considered as training a student (the model) on solving given problems (predict masked patches). However, we argue that the model should not only focus on solving given problems, but also stand in the shoes of a teacher to produce a more challenging problem by itself. To this end, we propose Hard Patches Mining (HPM), a brand-new framework for MIM pre-training. We observe that the reconstruction loss can naturally be the metric of the difficulty of the pre-training task. Therefore, we introduce an auxiliary loss predictor, predicting patch-wise losses first and deciding where to mask next. It adopts a relative relationship learning strategy to prevent overfitting to exact reconstruction loss values. Experiments under various settings demonstrate the effectiveness of HPM in constructing masked images. Furthermore, we empirically find that solely introducing the loss prediction objective leads to powerful representations, verifying the efficacy of the ability to be aware of where is hard to reconstruct.

* Accepted to CVPR 2023 
Viaarxiv icon

Robust Outlier Rejection for 3D Registration with Variational Bayes

Apr 04, 2023
Haobo Jiang, Zheng Dang, Zhen Wei, Jin Xie, Jian Yang, Mathieu Salzmann

Figure 1 for Robust Outlier Rejection for 3D Registration with Variational Bayes
Figure 2 for Robust Outlier Rejection for 3D Registration with Variational Bayes
Figure 3 for Robust Outlier Rejection for 3D Registration with Variational Bayes
Figure 4 for Robust Outlier Rejection for 3D Registration with Variational Bayes

Learning-based outlier (mismatched correspondence) rejection for robust 3D registration generally formulates the outlier removal as an inlier/outlier classification problem. The core for this to be successful is to learn the discriminative inlier/outlier feature representations. In this paper, we develop a novel variational non-local network-based outlier rejection framework for robust alignment. By reformulating the non-local feature learning with variational Bayesian inference, the Bayesian-driven long-range dependencies can be modeled to aggregate discriminative geometric context information for inlier/outlier distinction. Specifically, to achieve such Bayesian-driven contextual dependencies, each query/key/value component in our non-local network predicts a prior feature distribution and a posterior one. Embedded with the inlier/outlier label, the posterior feature distribution is label-dependent and discriminative. Thus, pushing the prior to be close to the discriminative posterior in the training step enables the features sampled from this prior at test time to model high-quality long-range dependencies. Notably, to achieve effective posterior feature guidance, a specific probabilistic graphical model is designed over our non-local model, which lets us derive a variational low bound as our optimization objective for model training. Finally, we propose a voting-based inlier searching strategy to cluster the high-quality hypothetical inliers for transformation estimation. Extensive experiments on 3DMatch, 3DLoMatch, and KITTI datasets verify the effectiveness of our method.

* Accepted by CVPR2023 
Viaarxiv icon

LEAPS: End-to-End One-Step Person Search With Learnable Proposals

Mar 21, 2023
Zhiqiang Dong, Jiale Cao, Rao Muhammad Anwer, Jin Xie, Fahad Khan, Yanwei Pang

Figure 1 for LEAPS: End-to-End One-Step Person Search With Learnable Proposals
Figure 2 for LEAPS: End-to-End One-Step Person Search With Learnable Proposals
Figure 3 for LEAPS: End-to-End One-Step Person Search With Learnable Proposals
Figure 4 for LEAPS: End-to-End One-Step Person Search With Learnable Proposals

We propose an end-to-end one-step person search approach with learnable proposals, named LEAPS. Given a set of sparse and learnable proposals, LEAPS employs a dynamic person search head to directly perform person detection and corresponding re-id feature generation without non-maximum suppression post-processing. The dynamic person search head comprises a detection head and a novel flexible re-id head. Our flexible re-id head first employs a dynamic region-of-interest (RoI) operation to extract discriminative RoI features of the proposals. Then, it generates re-id features using a plain and a hierarchical interaction re-id module. To better guide discriminative re-id feature learning, we introduce a diverse re-id sample matching strategy, instead of bipartite matching in detection head. Comprehensive experiments reveal the benefit of the proposed LEAPS, achieving a favorable performance on two public person search benchmarks: CUHK-SYSU and PRW. When using the same ResNet50 backbone, our LEAPS obtains a mAP score of 55.0%, outperforming the best reported results in literature by 1.7%, while achieving around a two-fold speedup on the challenging PRW dataset. Our source code and models will be released.

* 11 pages, 9 figures 
Viaarxiv icon

Large-scale Point Cloud Registration Based on Graph Matching Optimization

Feb 16, 2023
Qianliang Wu, Yaqi Shen, Guofeng Mei, Yaqing Ding, Lei Luo, Jin Xie, Jian Yang

Figure 1 for Large-scale Point Cloud Registration Based on Graph Matching Optimization
Figure 2 for Large-scale Point Cloud Registration Based on Graph Matching Optimization
Figure 3 for Large-scale Point Cloud Registration Based on Graph Matching Optimization
Figure 4 for Large-scale Point Cloud Registration Based on Graph Matching Optimization

Point Clouds Registration is a fundamental and challenging problem in 3D computer vision. It has been shown that the isometric transformation is an essential property in rigid point cloud registration, but the existing methods only utilize it in the outlier rejection stage. In this paper, we emphasize that the isometric transformation is also important in the feature learning stage for improving registration quality. We propose a \underline{G}raph \underline{M}atching \underline{O}ptimization based \underline{Net}work (denoted as GMONet for short), which utilizes the graph matching method to explicitly exert the isometry preserving constraints in the point feature learning stage to improve %refine the point representation. Specifically, we %use exploit the partial graph matching constraint to enhance the overlap region detection abilities of super points ($i.e.,$ down-sampled key points) and full graph matching to refine the registration accuracy at the fine-level overlap region. Meanwhile, we leverage the mini-batch sampling to improve the efficiency of the full graph matching optimization. Given high discriminative point features in the evaluation stage, we utilize the RANSAC approach to estimate the transformation between the scanned pairs. The proposed method has been evaluated on the 3DMatch/3DLoMatch benchmarks and the KITTI benchmark. The experimental results show that our method achieves competitive performance compared with the existing state-of-the-art baselines.

Viaarxiv icon