Alert button
Picture for Ji Hou

Ji Hou

Alert button

NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection

Jul 27, 2023
Chenfeng Xu, Bichen Wu, Ji Hou, Sam Tsai, Ruilong Li, Jialiang Wang, Wei Zhan, Zijian He, Peter Vajda, Kurt Keutzer, Masayoshi Tomizuka

We present NeRF-Det, a novel method for indoor 3D detection with posed RGB images as input. Unlike existing indoor 3D detection methods that struggle to model scene geometry, our method makes novel use of NeRF in an end-to-end manner to explicitly estimate 3D geometry, thereby improving 3D detection performance. Specifically, to avoid the significant extra latency associated with per-scene optimization of NeRF, we introduce sufficient geometry priors to enhance the generalizability of NeRF-MLP. Furthermore, we subtly connect the detection and NeRF branches through a shared MLP, enabling an efficient adaptation of NeRF to detection and yielding geometry-aware volumetric representations for 3D detection. Our method outperforms state-of-the-arts by 3.9 mAP and 3.1 mAP on the ScanNet and ARKITScenes benchmarks, respectively. We provide extensive analysis to shed light on how NeRF-Det works. As a result of our joint-training design, NeRF-Det is able to generalize well to unseen scenes for object detection, view synthesis, and depth estimation tasks without requiring per-scene optimization. Code is available at \url{https://github.com/facebookresearch/NeRF-Det}.

* Accepted by ICCV 2023 
Viaarxiv icon

Rotation-Invariant Transformer for Point Cloud Matching

Mar 25, 2023
Hao Yu, Zheng Qin, Ji Hou, Mahdi Saleh, Dongsheng Li, Benjamin Busam, Slobodan Ilic

Figure 1 for Rotation-Invariant Transformer for Point Cloud Matching
Figure 2 for Rotation-Invariant Transformer for Point Cloud Matching
Figure 3 for Rotation-Invariant Transformer for Point Cloud Matching
Figure 4 for Rotation-Invariant Transformer for Point Cloud Matching

The intrinsic rotation invariance lies at the core of matching point clouds with handcrafted descriptors. However, it is widely despised by recent deep matchers that obtain the rotation invariance extrinsically via data augmentation. As the finite number of augmented rotations can never span the continuous SO(3) space, these methods usually show instability when facing rotations that are rarely seen. To this end, we introduce RoITr, a Rotation-Invariant Transformer to cope with the pose variations in the point cloud matching task. We contribute both on the local and global levels. Starting from the local level, we introduce an attention mechanism embedded with Point Pair Feature (PPF)-based coordinates to describe the pose-invariant geometry, upon which a novel attention-based encoder-decoder architecture is constructed. We further propose a global transformer with rotation-invariant cross-frame spatial awareness learned by the self-attention mechanism, which significantly improves the feature distinctiveness and makes the model robust with respect to the low overlap. Experiments are conducted on both the rigid and non-rigid public benchmarks, where RoITr outperforms all the state-of-the-art models by a considerable margin in the low-overlapping scenarios. Especially when the rotations are enlarged on the challenging 3DLoMatch benchmark, RoITr surpasses the existing methods by at least 13 and 5 percentage points in terms of Inlier Ratio and Registration Recall, respectively.

* Accepted to CVPR 2023 (camera-ready version) 
Viaarxiv icon

Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors

Feb 28, 2023
Ji Hou, Xiaoliang Dai, Zijian He, Angela Dai, Matthias Nießner

Figure 1 for Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors
Figure 2 for Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors
Figure 3 for Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors
Figure 4 for Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors

Current popular backbones in computer vision, such as Vision Transformers (ViT) and ResNets are trained to perceive the world from 2D images. However, to more effectively understand 3D structural priors in 2D backbones, we propose Mask3D to leverage existing large-scale RGB-D data in a self-supervised pre-training to embed these 3D priors into 2D learned feature representations. In contrast to traditional 3D contrastive learning paradigms requiring 3D reconstructions or multi-view correspondences, our approach is simple: we formulate a pre-text reconstruction task by masking RGB and depth patches in individual RGB-D frames. We demonstrate the Mask3D is particularly effective in embedding 3D priors into the powerful 2D ViT backbone, enabling improved representation learning for various scene understanding tasks, such as semantic segmentation, instance segmentation and object detection. Experiments show that Mask3D notably outperforms existing self-supervised 3D pre-training approaches on ScanNet, NYUv2, and Cityscapes image understanding tasks, with an improvement of +6.5% mIoU against the state-of-the-art Pri3D on ScanNet image semantic segmentation.

* accepted to CVPR2023 
Viaarxiv icon

PCR-CG: Point Cloud Registration via Deep Color and Geometry

Feb 28, 2023
Yu Zhang, Junle Yu, Xiaolin Huang, Wenhui Zhou, Ji Hou

Figure 1 for PCR-CG: Point Cloud Registration via Deep Color and Geometry
Figure 2 for PCR-CG: Point Cloud Registration via Deep Color and Geometry
Figure 3 for PCR-CG: Point Cloud Registration via Deep Color and Geometry
Figure 4 for PCR-CG: Point Cloud Registration via Deep Color and Geometry

In this paper, we introduce PCR-CG: a novel 3D point cloud registration module explicitly embedding the color signals into the geometry representation. Different from previous methods that only use geometry representation, our module is specifically designed to effectively correlate color into geometry for the point cloud registration task. Our key contribution is a 2D-3D cross-modality learning algorithm that embeds the deep features learned from color signals to the geometry representation. With our designed 2D-3D projection module, the pixel features in a square region centered at correspondences perceived from images are effectively correlated with point clouds. In this way, the overlapped regions can be inferred not only from point cloud but also from the texture appearances. Adding color is non-trivial. We compare against a variety of baselines designed for adding color to 3D, such as exhaustively adding per-pixel features or RGB values in an implicit manner. We leverage Predator [25] as the baseline method and incorporate our proposed module onto it. To validate the effectiveness of 2D features, we ablate different 2D pre-trained networks and show a positive correlation between the pre-trained weights and the task performance. Our experimental results indicate a significant improvement of 6.5% registration recall over the baseline method on the 3DLoMatch benchmark. We additionally evaluate our approach on SOTA methods and observe consistent improvements, such as an improvement of 2.4% registration recall over GeoTransformer as well as 3.5% over CoFiNet. Our study reveals a significant advantages of correlating explicit deep color features to the point cloud in the registration task.

* accepted to ECCV2022; code at https://github.com/Gardlin/PCR-CG 
Viaarxiv icon

RIGA: Rotation-Invariant and Globally-Aware Descriptors for Point Cloud Registration

Sep 27, 2022
Hao Yu, Ji Hou, Zheng Qin, Mahdi Saleh, Ivan Shugurov, Kai Wang, Benjamin Busam, Slobodan Ilic

Figure 1 for RIGA: Rotation-Invariant and Globally-Aware Descriptors for Point Cloud Registration
Figure 2 for RIGA: Rotation-Invariant and Globally-Aware Descriptors for Point Cloud Registration
Figure 3 for RIGA: Rotation-Invariant and Globally-Aware Descriptors for Point Cloud Registration
Figure 4 for RIGA: Rotation-Invariant and Globally-Aware Descriptors for Point Cloud Registration

Successful point cloud registration relies on accurate correspondences established upon powerful descriptors. However, existing neural descriptors either leverage a rotation-variant backbone whose performance declines under large rotations, or encode local geometry that is less distinctive. To address this issue, we introduce RIGA to learn descriptors that are Rotation-Invariant by design and Globally-Aware. From the Point Pair Features (PPFs) of sparse local regions, rotation-invariant local geometry is encoded into geometric descriptors. Global awareness of 3D structures and geometric context is subsequently incorporated, both in a rotation-invariant fashion. More specifically, 3D structures of the whole frame are first represented by our global PPF signatures, from which structural descriptors are learned to help geometric descriptors sense the 3D world beyond local regions. Geometric context from the whole scene is then globally aggregated into descriptors. Finally, the description of sparse regions is interpolated to dense point descriptors, from which correspondences are extracted for registration. To validate our approach, we conduct extensive experiments on both object- and scene-level data. With large rotations, RIGA surpasses the state-of-the-art methods by a margin of 8\degree in terms of the Relative Rotation Error on ModelNet40 and improves the Feature Matching Recall by at least 5 percentage points on 3DLoMatch.

Viaarxiv icon

Panoptic 3D Scene Reconstruction From a Single RGB Image

Nov 03, 2021
Manuel Dahnert, Ji Hou, Matthias Nießner, Angela Dai

Figure 1 for Panoptic 3D Scene Reconstruction From a Single RGB Image
Figure 2 for Panoptic 3D Scene Reconstruction From a Single RGB Image
Figure 3 for Panoptic 3D Scene Reconstruction From a Single RGB Image
Figure 4 for Panoptic 3D Scene Reconstruction From a Single RGB Image

Understanding 3D scenes from a single image is fundamental to a wide variety of tasks, such as for robotics, motion planning, or augmented reality. Existing works in 3D perception from a single RGB image tend to focus on geometric reconstruction only, or geometric reconstruction with semantic segmentation or instance segmentation. Inspired by 2D panoptic segmentation, we propose to unify the tasks of geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation into the task of panoptic 3D scene reconstruction - from a single RGB image, predicting the complete geometric reconstruction of the scene in the camera frustum of the image, along with semantic and instance segmentations. We thus propose a new approach for holistic 3D scene understanding from a single RGB image which learns to lift and propagate 2D features from an input image to a 3D volumetric scene representation. We demonstrate that this holistic view of joint scene reconstruction, semantic, and instance segmentation is beneficial over treating the tasks independently, thus outperforming alternative approaches.

* Video: https://youtu.be/YVxRNHmd5SA 
Viaarxiv icon

Pri3D: Can 3D Priors Help 2D Representation Learning?

Apr 22, 2021
Ji Hou, Saining Xie, Benjamin Graham, Angela Dai, Matthias Nießner

Figure 1 for Pri3D: Can 3D Priors Help 2D Representation Learning?
Figure 2 for Pri3D: Can 3D Priors Help 2D Representation Learning?
Figure 3 for Pri3D: Can 3D Priors Help 2D Representation Learning?
Figure 4 for Pri3D: Can 3D Priors Help 2D Representation Learning?

Recent advances in 3D perception have shown impressive progress in understanding geometric structures of 3Dshapes and even scenes. Inspired by these advances in geometric understanding, we aim to imbue image-based perception with representations learned under geometric constraints. We introduce an approach to learn view-invariant,geometry-aware representations for network pre-training, based on multi-view RGB-D data, that can then be effectively transferred to downstream 2D tasks. We propose to employ contrastive learning under both multi-view im-age constraints and image-geometry constraints to encode3D priors into learned 2D representations. This results not only in improvement over 2D-only representation learning on the image-based tasks of semantic segmentation, instance segmentation, and object detection on real-world in-door datasets, but moreover, provides significant improvement in the low data regime. We show a significant improvement of 6.0% on semantic segmentation on full data as well as 11.9% on 20% data against baselines on ScanNet.

Viaarxiv icon

Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts

Dec 16, 2020
Ji Hou, Benjamin Graham, Matthias Nießner, Saining Xie

Figure 1 for Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts
Figure 2 for Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts
Figure 3 for Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts
Figure 4 for Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts

The rapid progress in 3D scene understanding has come with growing demand for data; however, collecting and annotating 3D scenes (e.g. point clouds) are notoriously hard. For example, the number of scenes (e.g. indoor rooms) that can be accessed and scanned might be limited; even given sufficient data, acquiring 3D labels (e.g. instance masks) requires intensive human labor. In this paper, we explore data-efficient learning for 3D point cloud. As a first step towards this direction, we propose Contrastive Scene Contexts, a 3D pre-training method that makes use of both point-level correspondences and spatial contexts in a scene. Our method achieves state-of-the-art results on a suite of benchmarks where training data or labels are scarce. Our study reveals that exhaustive labelling of 3D point clouds might be unnecessary; and remarkably, on ScanNet, even using 0.1% of point labels, we still achieve 89% (instance segmentation) and 96% (semantic segmentation) of the baseline performance that uses full annotations.

* project page: https://sekunde.github.io/project_efficient/ 
Viaarxiv icon

RfD-Net: Point Scene Understanding by Semantic Instance Reconstruction

Nov 30, 2020
Yinyu Nie, Ji Hou, Xiaoguang Han, Matthias Nießner

Figure 1 for RfD-Net: Point Scene Understanding by Semantic Instance Reconstruction
Figure 2 for RfD-Net: Point Scene Understanding by Semantic Instance Reconstruction
Figure 3 for RfD-Net: Point Scene Understanding by Semantic Instance Reconstruction
Figure 4 for RfD-Net: Point Scene Understanding by Semantic Instance Reconstruction

Semantic scene understanding from point clouds is particularly challenging as the points reflect only a sparse set of the underlying 3D geometry. Previous works often convert point cloud into regular grids (e.g. voxels or bird-eye view images), and resort to grid-based convolutions for scene understanding. In this work, we introduce RfD-Net that jointly detects and reconstructs dense object surfaces directly from raw point clouds. Instead of representing scenes with regular grids, our method leverages the sparsity of point cloud data and focuses on predicting shapes that are recognized with high objectness. With this design, we decouple the instance reconstruction into global object localization and local shape prediction. It not only eases the difficulty of learning 2-D manifold surfaces from sparse 3D space, the point clouds in each object proposal convey shape details that support implicit function learning to reconstruct any high-resolution surfaces. Our experiments indicate that instance detection and reconstruction present complementary effects, where the shape prediction head shows consistent effects on improving object detection with modern 3D proposal network backbones. The qualitative and quantitative evaluations further demonstrate that our approach consistently outperforms the state-of-the-arts and improves over 11 of mesh IoU in object reconstruction.

Viaarxiv icon