Large-scale place recognition is a fundamental but challenging task, which plays an increasingly important role in autonomous driving and robotics. Existing methods have achieved acceptable good performance, however, most of them are concentrating on designing elaborate global descriptor learning network structures. The importance of feature generalization and descriptor post-enhancing has long been neglected. In this work, we propose a novel method named GIDP to learn a Good Initialization and Inducing Descriptor Poseenhancing for Large-scale Place Recognition. In particular, an unsupervised momentum contrast point cloud pretraining module and a reranking-based descriptor post-enhancing module are proposed respectively in GIDP. The former aims at learning a good initialization for the point cloud encoding network before training the place recognition model, while the later aims at post-enhancing the predicted global descriptor through reranking at inference time. Extensive experiments on both indoor and outdoor datasets demonstrate that our method can achieve state-of-the-art performance using simple and general point cloud encoding backbones.
In this paper, we research the new topic of object effects recommendation in micro-video platforms, which is a challenging but important task for many practical applications such as advertisement insertion. To avoid the problem of introducing background bias caused by directly learning video content from image frames, we propose to utilize the meaningful body language hidden in 3D human pose for recommendation. To this end, in this work, a novel human pose driven object effects recommendation network termed PoseRec is introduced. PoseRec leverages the advantages of 3D human pose detection and learns information from multi-frame 3D human pose for video-item registration, resulting in high quality object effects recommendation performance. Moreover, to solve the inherent ambiguity and sparsity issues that exist in object effects recommendation, we further propose a novel item-aware implicit prototype learning module and a novel pose-aware transductive hard-negative mining module to better learn pose-item relationships. What's more, to benchmark methods for the new research topic, we build a new dataset for object effects recommendation named Pose-OBE. Extensive experiments on Pose-OBE demonstrate that our method can achieve superior performance than strong baselines.
MR-STAT is an emerging quantitative magnetic resonance imaging technique which aims at obtaining multi-parametric tissue parameter maps from single short scans. It describes the relationship between the spatial-domain tissue parameters and the time-domain measured signal by using a comprehensive, volumetric forward model. The MR-STAT reconstruction solves a large-scale nonlinear problem, thus is very computationally challenging. In previous work, MR-STAT reconstruction using Cartesian readout data was accelerated by approximating the Hessian matrix with sparse, banded blocks, and can be done on high performance CPU clusters with tens of minutes. In the current work, we propose an accelerated Cartesian MR-STAT algorithm incorporating two different strategies: firstly, a neural network is trained as a fast surrogate to learn the magnetization signal not only in the full time-domain but also in the compressed lowrank domain; secondly, based on the surrogate model, the Cartesian MR-STAT problem is re-formulated and split into smaller sub-problems by the alternating direction method of multipliers. The proposed method substantially reduces the computational requirements for runtime and memory. Simulated and in-vivo balanced MR-STAT experiments show similar reconstruction results using the proposed algorithm compared to the previous sparse Hessian method, and the reconstruction times are at least 40 times shorter. Incorporating sensitivity encoding and regularization terms is straightforward, and allows for better image quality with a negligible increase in reconstruction time. The proposed algorithm could reconstruct both balanced and gradient-spoiled in-vivo data within 3 minutes on a desktop PC, and could thereby facilitate the translation of MR-STAT in clinical settings.
Point clouds scanned by real-world sensors are always incomplete, irregular, and noisy, making the point cloud completion task become increasingly more important. Though many point cloud completion methods have been proposed, most of them require a large number of paired complete-incomplete point clouds for training, which is labor exhausted. In contrast, this paper proposes a novel Reconstruction-Aware Prior Distillation semi-supervised point cloud completion method named RaPD, which takes advantage of a two-stage training scheme to reduce the dependence on a large-scale paired dataset. In training stage 1, the so-called deep semantic prior is learned from both unpaired complete and unpaired incomplete point clouds using a reconstruction-aware pretraining process. While in training stage 2, we introduce a semi-supervised prior distillation process, where an encoder-decoder-based completion network is trained by distilling the prior into the network utilizing only a small number of paired training samples. A self-supervised completion module is further introduced, excavating the value of a large number of unpaired incomplete point clouds, leading to an increase in the network's performance. Extensive experiments on several widely used datasets demonstrate that RaPD, the first semi-supervised point cloud completion method, achieves superior performance to previous methods on both homologous and heterologous scenarios.
Recently, RGBD-based category-level 6D object pose estimation has achieved promising improvement in performance, however, the requirement of depth information prohibits broader applications. In order to relieve this problem, this paper proposes a novel approach named Object Level Depth reconstruction Network (OLD-Net) taking only RGB images as input for category-level 6D object pose estimation. We propose to directly predict object-level depth from a monocular RGB image by deforming the category-level shape prior into object-level depth and the canonical NOCS representation. Two novel modules named Normalized Global Position Hints (NGPH) and Shape-aware Decoupled Depth Reconstruction (SDDR) module are introduced to learn high fidelity object-level depth and delicate shape representations. At last, the 6D object pose is solved by aligning the predicted canonical representation with the back-projected object-level depth. Extensive experiments on the challenging CAMERA25 and REAL275 datasets indicate that our model, though simple, achieves state-of-the-art performance.
Recently, category-level 6D object pose estimation has achieved significant improvements with the development of reconstructing canonical 3D representations. However, the reconstruction quality of existing methods is still far from excellent. In this paper, we propose a novel Adversarial Canonical Representation Reconstruction Network named ACR-Pose. ACR-Pose consists of a Reconstructor and a Discriminator. The Reconstructor is primarily composed of two novel sub-modules: Pose-Irrelevant Module (PIM) and Relational Reconstruction Module (RRM). PIM tends to learn canonical-related features to make the Reconstructor insensitive to rotation and translation, while RRM explores essential relational information between different input modalities to generate high-quality features. Subsequently, a Discriminator is employed to guide the Reconstructor to generate realistic canonical representations. The Reconstructor and the Discriminator learn to optimize through adversarial training. Experimental results on the prevalent NOCS-CAMERA and NOCS-REAL datasets demonstrate that our method achieves state-of-the-art performance.
Autonomous Driving and Simultaneous Localization and Mapping(SLAM) are becoming increasingly important in real world, where point cloud-based large scale place recognition is the spike of them. Previous place recognition methods have achieved acceptable performances by regarding the task as a point cloud retrieval problem. However, all of them are suffered from a common defect: they can't handle the situation when the point clouds are rotated, which is common, e.g, when viewpoints or motorcycle types are changed. To tackle this issue, we propose an Attentive Rotation Invariant Convolution (ARIConv) in this paper. The ARIConv adopts three kind of Rotation Invariant Features (RIFs): Spherical Signals (SS), Individual-Local Rotation Invariant Features (ILRIF) and Group-Local Rotation Invariant features (GLRIF) in its structure to learn rotation invariant convolutional kernels, which are robust for learning rotation invariant point cloud features. What's more, to highlight pivotal RIFs, we inject an attentive module in ARIConv to give different RIFs different importance when learning kernels. Finally, utilizing ARIConv, we build a DenseNet-like network architecture to learn rotation-insensitive global descriptors used for retrieving. We experimentally demonstrate that our model can achieve state-of-the-art performance on large scale place recognition task when the point cloud scans are rotated and can achieve comparable results with most of existing methods on the original non-rotated datasets.
Point cloud-based large scale place recognition is fundamental for many applications like Simultaneous Localization and Mapping (SLAM). Although many models have been proposed and have achieved good performance by learning short-range local features, long-range contextual properties have often been neglected. Moreover, the model size has also become a bottleneck for their wide applications. To overcome these challenges, we propose a super light-weight network model termed SVT-Net for large scale place recognition. Specifically, on top of the highly efficient 3D Sparse Convolution (SP-Conv), an Atom-based Sparse Voxel Transformer (ASVT) and a Cluster-based Sparse Voxel Transformer (CSVT) are proposed to learn both short-range local features and long-range contextual features in this model. Consisting of ASVT and CSVT, SVT-Net can achieve state-of-the-art on benchmark datasets in terms of both accuracy and speed with a super-light model size (0.9M). Meanwhile, two simplified versions of SVT-Net are introduced, which also achieve state-of-the-art and further reduce the model size to 0.8M and 0.4M respectively.
Object pose detection and tracking has recently attracted increasing attention due to its wide applications in many areas, such as autonomous driving, robotics, and augmented reality. Among methods for object pose detection and tracking, deep learning is the most promising one that has shown better performance than others. However, there is lack of survey study about latest development of deep learning based methods. Therefore, this paper presents a comprehensive review of recent progress in object pose detection and tracking that belongs to the deep learning technical route. To achieve a more thorough introduction, the scope of this paper is limited to methods taking monocular RGB/RGBD data as input, covering three kinds of major tasks: instance-level monocular object pose detection, category-level monocular object pose detection, and monocular object pose tracking. In our work, metrics, datasets, and methods about both detection and tracking are presented in detail. Comparative results of current state-of-the-art methods on several publicly available datasets are also presented, together with insightful observations and inspiring future research directions.
Point cloud-based large scale place recognition is fundamental for many applications like Simultaneous Localization and Mapping (SLAM). Though previous methods have achieved good performance by learning short range local features, long range contextual properties have long been neglected. And model size has became a bottleneck for further popularizing. In this paper, we propose model SVTNet, a super light-weight network, for large scale place recognition. In our work, building on top of the highefficiency 3D Sparse Convolution (SP-Conv), an Atom-based Sparse Voxel Transformer (ASVT) and a Cluster-based Sparse Voxel Transformer (CSVT) are proposed to learn both short range local features and long range contextual features. Consisting of ASVT and CSVT, our SVT-Net can achieve state-of-art performance in terms of both accuracy and speed with a super-light model size (0.9M). Two simplified version of SVT-Net named ASVT-Net and CSVT-Net are also introduced, which also achieve state-of-art performances while further reduce the model size to 0.8M and 0.4M respectively.