Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Angela Dai

Language-Grounded Indoor 3D Semantic Segmentation in the Wild

Apr 16, 2022

David Rozenberszki, Or Litany, Angela Dai

Figure 1 for Language-Grounded Indoor 3D Semantic Segmentation in the Wild

Figure 2 for Language-Grounded Indoor 3D Semantic Segmentation in the Wild

Figure 3 for Language-Grounded Indoor 3D Semantic Segmentation in the Wild

Figure 4 for Language-Grounded Indoor 3D Semantic Segmentation in the Wild

Abstract:Recent advances in 3D semantic segmentation with deep neural networks have shown remarkable success, with rapid performance increase on available datasets. However, current 3D semantic segmentation benchmarks contain only a small number of categories -- less than 30 for ScanNet and SemanticKITTI, for instance, which are not enough to reflect the diversity of real environments (e.g., semantic image understanding covers hundreds to thousands of classes). Thus, we propose to study a larger vocabulary for 3D semantic segmentation with a new extended benchmark on ScanNet data with 200 class categories, an order of magnitude more than previously studied. This large number of class categories also induces a large natural class imbalance, both of which are challenging for existing 3D semantic segmentation methods. To learn more robust 3D features in this context, we propose a language-driven pre-training method to encourage learned 3D features that might have limited training examples to lie close to their pre-trained text embeddings. Extensive experiments show that our approach consistently outperforms state-of-the-art 3D pre-training for 3D semantic segmentation on our proposed benchmark (+9% relative mIoU), including limited-data scenarios with +25% relative mIoU using only 5% annotations.

* 23 pages, 8 figures, project page: https://rozdavid.github.io/scannet200

Via

Access Paper or Ask Questions

Texturify: Generating Textures on 3D Shape Surfaces

Apr 05, 2022

Yawar Siddiqui, Justus Thies, Fangchang Ma, Qi Shan, Matthias Nießner, Angela Dai

Figure 1 for Texturify: Generating Textures on 3D Shape Surfaces

Figure 2 for Texturify: Generating Textures on 3D Shape Surfaces

Figure 3 for Texturify: Generating Textures on 3D Shape Surfaces

Figure 4 for Texturify: Generating Textures on 3D Shape Surfaces

Abstract:Texture cues on 3D objects are key to compelling visual representations, with the possibility to create high visual fidelity with inherent spatial consistency across different views. Since the availability of textured 3D shapes remains very limited, learning a 3D-supervised data-driven method that predicts a texture based on the 3D input is very challenging. We thus propose Texturify, a GAN-based method that leverages a 3D shape dataset of an object class and learns to reproduce the distribution of appearances observed in real images by generating high-quality textures. In particular, our method does not require any 3D color supervision or correspondence between shape geometry and images to learn the texturing of 3D objects. Texturify operates directly on the surface of the 3D objects by introducing face convolutional operators on a hierarchical 4-RoSy parametrization to generate plausible object-specific textures. Employing differentiable rendering and adversarial losses that critique individual views and consistency across views, we effectively learn the high-quality surface texturing distribution from real-world images. Experiments on car and chair shape collections show that our approach outperforms state of the art by an average of 22% in FID score.

* Project Page: https://nihalsid.github.io/texturify

Via

Access Paper or Ask Questions

Weakly-Supervised End-to-End CAD Retrieval to Scan Objects

Mar 24, 2022

Tim Beyer, Angela Dai

Figure 1 for Weakly-Supervised End-to-End CAD Retrieval to Scan Objects

Figure 2 for Weakly-Supervised End-to-End CAD Retrieval to Scan Objects

Figure 3 for Weakly-Supervised End-to-End CAD Retrieval to Scan Objects

Figure 4 for Weakly-Supervised End-to-End CAD Retrieval to Scan Objects

Abstract:CAD model retrieval to real-world scene observations has shown strong promise as a basis for 3D perception of objects and a clean, lightweight mesh-based scene representation; however, current approaches to retrieve CAD models to a query scan rely on expensive manual annotations of 1:1 associations of CAD-scan objects, which typically contain strong lower-level geometric differences. We thus propose a new weakly-supervised approach to retrieve semantically and structurally similar CAD models to a query 3D scanned scene without requiring any CAD-scan associations, and only object detection information as oriented bounding boxes. Our approach leverages a fully-differentiable top-$k$ retrieval layer, enabling end-to-end training guided by geometric and perceptual similarity of the top retrieved CAD models to the scan queries. We demonstrate that our weakly-supervised approach can outperform fully-supervised retrieval methods on challenging real-world ScanNet scans, and maintain robustness for unseen class categories, achieving significantly improved performance over fully-supervised state of the art in zero-shot CAD retrieval.

* Accompanying video at https://youtu.be/3bCUMxpscdQ

Via

Access Paper or Ask Questions

Neural Part Priors: Learning to Optimize Part-Based Object Completion in RGB-D Scans

Mar 17, 2022

Alexey Bokhovkin, Angela Dai

Figure 1 for Neural Part Priors: Learning to Optimize Part-Based Object Completion in RGB-D Scans

Figure 2 for Neural Part Priors: Learning to Optimize Part-Based Object Completion in RGB-D Scans

Figure 3 for Neural Part Priors: Learning to Optimize Part-Based Object Completion in RGB-D Scans

Figure 4 for Neural Part Priors: Learning to Optimize Part-Based Object Completion in RGB-D Scans

Abstract:3D object recognition has seen significant advances in recent years, showing impressive performance on real-world 3D scan benchmarks, but lacking in object part reasoning, which is fundamental to higher-level scene understanding such as inter-object similarities or object functionality. Thus, we propose to leverage large-scale synthetic datasets of 3D shapes annotated with part information to learn Neural Part Priors (NPPs), optimizable spaces characterizing geometric part priors. Crucially, we can optimize over the learned part priors in order to fit to real-world scanned 3D scenes at test time, enabling robust part decomposition of the real objects in these scenes that also estimates the complete geometry of the object while fitting accurately to the observed real geometry. Moreover, this enables global optimization over geometrically similar detected objects in a scene, which often share strong geometric commonalities, enabling scene-consistent part decompositions. Experiments on the ScanNet dataset demonstrate that NPPs significantly outperforms state of the art in part decomposition and object completion in real-world scenes.

Via

Access Paper or Ask Questions

SPAMs: Structured Implicit Parametric Models

Jan 20, 2022

Pablo Palafox, Nikolaos Sarafianos, Tony Tung, Angela Dai

Figure 1 for SPAMs: Structured Implicit Parametric Models

Figure 2 for SPAMs: Structured Implicit Parametric Models

Figure 3 for SPAMs: Structured Implicit Parametric Models

Figure 4 for SPAMs: Structured Implicit Parametric Models

Abstract:Parametric 3D models have formed a fundamental role in modeling deformable objects, such as human bodies, faces, and hands; however, the construction of such parametric models requires significant manual intervention and domain expertise. Recently, neural implicit 3D representations have shown great expressibility in capturing 3D shape geometry. We observe that deformable object motion is often semantically structured, and thus propose to learn Structured-implicit PArametric Models (SPAMs) as a deformable object representation that structurally decomposes non-rigid object motion into part-based disentangled representations of shape and pose, with each being represented by deep implicit functions. This enables a structured characterization of object movement, with part decomposition characterizing a lower-dimensional space in which we can establish coarse motion correspondence. In particular, we can leverage the part decompositions at test time to fit to new depth sequences of unobserved shapes, by establishing part correspondences between the input observation and our learned part spaces; this guides a robust joint optimization between the shape and pose of all parts, even under dramatic motion sequences. Experiments demonstrate that our part-aware shape and pose understanding lead to state-of-the-art performance in reconstruction and tracking of depth sequences of complex deforming object motion. We plan to release models to the public at https://pablopalafox.github.io/spams.

* Project page: https://pablopalafox.github.io/spams/ - Video: https://youtu.be/ChdjHNGgrzI

Via

Access Paper or Ask Questions

4DContrast: Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding

Dec 06, 2021

Yujin Chen, Matthias Nießner, Angela Dai

Figure 1 for 4DContrast: Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding

Figure 2 for 4DContrast: Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding

Figure 3 for 4DContrast: Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding

Figure 4 for 4DContrast: Contrastive Learning with Dynamic Correspondences for 3D Scene Understanding

Abstract:We present a new approach to instill 4D dynamic object priors into learned 3D representations by unsupervised pre-training. We observe that dynamic movement of an object through an environment provides important cues about its objectness, and thus propose to imbue learned 3D representations with such dynamic understanding, that can then be effectively transferred to improved performance in downstream 3D semantic scene understanding tasks. We propose a new data augmentation scheme leveraging synthetic 3D shapes moving in static 3D environments, and employ contrastive learning under 3D-4D constraints that encode 4D invariances into the learned 3D representations. Experiments demonstrate that our unsupervised representation learning results in improvement in downstream 3D semantic segmentation, object detection, and instance segmentation tasks, and moreover, notably improves performance in data-scarce scenarios.

* Video: https://youtu.be/qhGhWZmJq3U

Via

Access Paper or Ask Questions

ROCA: Robust CAD Model Retrieval and Alignment from a Single Image

Dec 03, 2021

Can Gümeli, Angela Dai, Matthias Nießner

Figure 1 for ROCA: Robust CAD Model Retrieval and Alignment from a Single Image

Figure 2 for ROCA: Robust CAD Model Retrieval and Alignment from a Single Image

Figure 3 for ROCA: Robust CAD Model Retrieval and Alignment from a Single Image

Figure 4 for ROCA: Robust CAD Model Retrieval and Alignment from a Single Image

Abstract:We present ROCA, a novel end-to-end approach that retrieves and aligns 3D CAD models from a shape database to a single input image. This enables 3D perception of an observed scene from a 2D RGB observation, characterized as a lightweight, compact, clean CAD representation. Core to our approach is our differentiable alignment optimization based on dense 2D-3D object correspondences and Procrustes alignment. ROCA can thus provide a robust CAD alignment while simultaneously informing CAD retrieval by leveraging the 2D-3D correspondences to learn geometrically similar CAD models. Experiments on challenging, real-world imagery from ScanNet show that ROCA significantly improves on state of the art, from 9.5% to 17.6% in retrieval-aware CAD alignment accuracy.

Via

Access Paper or Ask Questions

Pose2Room: Understanding 3D Scenes from Human Activities

Dec 01, 2021

Yinyu Nie, Angela Dai, Xiaoguang Han, Matthias Nießner

Figure 1 for Pose2Room: Understanding 3D Scenes from Human Activities

Figure 2 for Pose2Room: Understanding 3D Scenes from Human Activities

Figure 3 for Pose2Room: Understanding 3D Scenes from Human Activities

Figure 4 for Pose2Room: Understanding 3D Scenes from Human Activities

Abstract:With wearable IMU sensors, one can estimate human poses from wearable devices without requiring visual input \cite{von2017sparse}. In this work, we pose the question: Can we reason about object structure in real-world environments solely from human trajectory information? Crucially, we observe that human motion and interactions tend to give strong information about the objects in a scene -- for instance a person sitting indicates the likely presence of a chair or sofa. To this end, we propose P2R-Net to learn a probabilistic 3D model of the objects in a scene characterized by their class categories and oriented 3D bounding boxes, based on an input observed human trajectory in the environment. P2R-Net models the probability distribution of object class as well as a deep Gaussian mixture model for object boxes, enabling sampling of multiple, diverse, likely modes of object configurations from an observed human trajectory. In our experiments we demonstrate that P2R-Net can effectively learn multi-modal distributions of likely objects for human motions, and produce a variety of plausible object structures of the environment, even without any visual information.

* Project page: https://yinyunie.github.io/pose2room-page/ Video: https://www.youtube.com/watch?v=MFfKTcvbM5o

Via

Access Paper or Ask Questions

Panoptic 3D Scene Reconstruction From a Single RGB Image

Nov 03, 2021

Manuel Dahnert, Ji Hou, Matthias Nießner, Angela Dai

Figure 1 for Panoptic 3D Scene Reconstruction From a Single RGB Image

Figure 2 for Panoptic 3D Scene Reconstruction From a Single RGB Image

Figure 3 for Panoptic 3D Scene Reconstruction From a Single RGB Image

Figure 4 for Panoptic 3D Scene Reconstruction From a Single RGB Image

Abstract:Understanding 3D scenes from a single image is fundamental to a wide variety of tasks, such as for robotics, motion planning, or augmented reality. Existing works in 3D perception from a single RGB image tend to focus on geometric reconstruction only, or geometric reconstruction with semantic segmentation or instance segmentation. Inspired by 2D panoptic segmentation, we propose to unify the tasks of geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation into the task of panoptic 3D scene reconstruction - from a single RGB image, predicting the complete geometric reconstruction of the scene in the camera frustum of the image, along with semantic and instance segmentations. We thus propose a new approach for holistic 3D scene understanding from a single RGB image which learns to lift and propagate 2D features from an input image to a 3D volumetric scene representation. We demonstrate that this holistic view of joint scene reconstruction, semantic, and instance segmentation is beneficial over treating the tasks independently, thus outperforming alternative approaches.

* Video: https://youtu.be/YVxRNHmd5SA

Via

Access Paper or Ask Questions

Patch2CAD: Patchwise Embedding Learning for In-the-Wild Shape Retrieval from a Single Image

Aug 20, 2021

Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, Angela Dai

Figure 1 for Patch2CAD: Patchwise Embedding Learning for In-the-Wild Shape Retrieval from a Single Image

Figure 2 for Patch2CAD: Patchwise Embedding Learning for In-the-Wild Shape Retrieval from a Single Image

Figure 3 for Patch2CAD: Patchwise Embedding Learning for In-the-Wild Shape Retrieval from a Single Image

Figure 4 for Patch2CAD: Patchwise Embedding Learning for In-the-Wild Shape Retrieval from a Single Image

Abstract:3D perception of object shapes from RGB image input is fundamental towards semantic scene understanding, grounding image-based perception in our spatially 3-dimensional real-world environments. To achieve a mapping between image views of objects and 3D shapes, we leverage CAD model priors from existing large-scale databases, and propose a novel approach towards constructing a joint embedding space between 2D images and 3D CAD models in a patch-wise fashion -- establishing correspondences between patches of an image view of an object and patches of CAD geometry. This enables part similarity reasoning for retrieving similar CADs to a new image view without exact matches in the database. Our patch embedding provides more robust CAD retrieval for shape estimation in our end-to-end estimation of CAD model shape and pose for detected objects in a single input image. Experiments on in-the-wild, complex imagery from ScanNet show that our approach is more robust than state of the art in real-world scenarios without any exact CAD matches.

* To appear at ICCV 2021(IEEE/CVF International Conference on Computer Vision)

Via

Access Paper or Ask Questions