Alert button
Picture for Julian Straub

Julian Straub

Alert button

Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection

Oct 02, 2023
Yiming Xie, Huaizu Jiang, Georgia Gkioxari, Julian Straub

We present PARQ - a multi-view 3D object detector with transformer and pixel-aligned recurrent queries. Unlike previous works that use learnable features or only encode 3D point positions as queries in the decoder, PARQ leverages appearance-enhanced queries initialized from reference points in 3D space and updates their 3D location with recurrent cross-attention operations. Incorporating pixel-aligned features and cross attention enables the model to encode the necessary 3D-to-2D correspondences and capture global contextual information of the input images. PARQ outperforms prior best methods on the ScanNet and ARKitScenes datasets, learns and detects faster, is more robust to distribution shifts in reference points, can leverage additional input views without retraining, and can adapt inference compute by changing the number of recurrent iterations.

* ICCV 2023. Project page: https://ymingxie.github.io/parq 
Viaarxiv icon

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Sep 12, 2023
Kiran Somasundaram, Jing Dong, Huixuan Tang, Julian Straub, Mingfei Yan, Michael Goesele, Jakob Julian Engel, Renzo De Nardi, Richard Newcombe

Egocentric, multi-modal data as available on future augmented reality (AR) devices provides unique challenges and opportunities for machine perception. These future devices will need to be all-day wearable in a socially acceptable form-factor to support always available, context-aware and personalized AI applications. Our team at Meta Reality Labs Research built the Aria device, an egocentric, multi-modal data recording and streaming device with the goal to foster and accelerate research in this area. In this paper, we describe the Aria device hardware including its sensor configuration and the corresponding software tools that enable recording and processing of such data.

Viaarxiv icon

OrienterNet: Visual Localization in 2D Public Maps with Neural Matching

Apr 04, 2023
Paul-Edouard Sarlin, Daniel DeTone, Tsun-Yi Yang, Armen Avetisyan, Julian Straub, Tomasz Malisiewicz, Samuel Rota Bulo, Richard Newcombe, Peter Kontschieder, Vasileios Balntas

Figure 1 for OrienterNet: Visual Localization in 2D Public Maps with Neural Matching
Figure 2 for OrienterNet: Visual Localization in 2D Public Maps with Neural Matching
Figure 3 for OrienterNet: Visual Localization in 2D Public Maps with Neural Matching
Figure 4 for OrienterNet: Visual Localization in 2D Public Maps with Neural Matching

Humans can orient themselves in their 3D environments using simple 2D maps. Differently, algorithms for visual localization mostly rely on complex 3D point clouds that are expensive to build, store, and maintain over time. We bridge this gap by introducing OrienterNet, the first deep neural network that can localize an image with sub-meter accuracy using the same 2D semantic maps that humans use. OrienterNet estimates the location and orientation of a query image by matching a neural Bird's-Eye View with open and globally available maps from OpenStreetMap, enabling anyone to localize anywhere such maps are available. OrienterNet is supervised only by camera poses but learns to perform semantic matching with a wide range of map elements in an end-to-end manner. To enable this, we introduce a large crowd-sourced dataset of images captured across 12 cities from the diverse viewpoints of cars, bikes, and pedestrians. OrienterNet generalizes to new datasets and pushes the state of the art in both robotics and AR scenarios. The code and trained model will be released publicly.

* CVPR 2023 
Viaarxiv icon

Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild

Jul 21, 2022
Garrick Brazil, Julian Straub, Nikhila Ravi, Justin Johnson, Georgia Gkioxari

Figure 1 for Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild
Figure 2 for Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild
Figure 3 for Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild
Figure 4 for Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild

Recognizing scenes and objects in 3D from a single image is a longstanding goal of computer vision with applications in robotics and AR/VR. For 2D recognition, large datasets and scalable solutions have led to unprecedented advances. In 3D, existing benchmarks are small in size and approaches specialize in few object categories and specific domains, e.g. urban driving scenes. Motivated by the success of 2D recognition, we revisit the task of 3D object detection by introducing a large benchmark, called Omni3D. Omni3D re-purposes and combines existing datasets resulting in 234k images annotated with more than 3 million instances and 97 categories.3D detection at such scale is challenging due to variations in camera intrinsics and the rich diversity of scene and object types. We propose a model, called Cube R-CNN, designed to generalize across camera and scene types with a unified approach. We show that Cube R-CNN outperforms prior works on the larger Omni3D and existing benchmarks. Finally, we prove that Omni3D is a powerful dataset for 3D object recognition, show that it improves single-dataset performance and can accelerate learning on new smaller datasets via pre-training.

* Project website: https://garrickbrazil.com/omni3d 
Viaarxiv icon

Nerfels: Renderable Neural Codes for Improved Camera Pose Estimation

Jun 04, 2022
Gil Avraham, Julian Straub, Tianwei Shen, Tsun-Yi Yang, Hugo Germain, Chris Sweeney, Vasileios Balntas, David Novotny, Daniel DeTone, Richard Newcombe

Figure 1 for Nerfels: Renderable Neural Codes for Improved Camera Pose Estimation
Figure 2 for Nerfels: Renderable Neural Codes for Improved Camera Pose Estimation
Figure 3 for Nerfels: Renderable Neural Codes for Improved Camera Pose Estimation
Figure 4 for Nerfels: Renderable Neural Codes for Improved Camera Pose Estimation

This paper presents a framework that combines traditional keypoint-based camera pose optimization with an invertible neural rendering mechanism. Our proposed 3D scene representation, Nerfels, is locally dense yet globally sparse. As opposed to existing invertible neural rendering systems which overfit a model to the entire scene, we adopt a feature-driven approach for representing scene-agnostic, local 3D patches with renderable codes. By modelling a scene only where local features are detected, our framework effectively generalizes to unseen local regions in the scene via an optimizable code conditioning mechanism in the neural renderer, all while maintaining the low memory footprint of a sparse 3D map representation. Our model can be incorporated to existing state-of-the-art hand-crafted and learned local feature pose estimators, yielding improved performance when evaluating on ScanNet for wide camera baseline scenarios.

* Published at CVPRW with supplementary material 
Viaarxiv icon

ODAM: Object Detection, Association, and Mapping using Posed RGB Video

Aug 23, 2021
Kejie Li, Daniel DeTone, Steven Chen, Minh Vo, Ian Reid, Hamid Rezatofighi, Chris Sweeney, Julian Straub, Richard Newcombe

Figure 1 for ODAM: Object Detection, Association, and Mapping using Posed RGB Video
Figure 2 for ODAM: Object Detection, Association, and Mapping using Posed RGB Video
Figure 3 for ODAM: Object Detection, Association, and Mapping using Posed RGB Video
Figure 4 for ODAM: Object Detection, Association, and Mapping using Posed RGB Video

Localizing objects and estimating their extent in 3D is an important step towards high-level 3D scene understanding, which has many applications in Augmented Reality and Robotics. We present ODAM, a system for 3D Object Detection, Association, and Mapping using posed RGB videos. The proposed system relies on a deep learning front-end to detect 3D objects from a given RGB frame and associate them to a global object-based map using a graph neural network (GNN). Based on these frame-to-model associations, our back-end optimizes object bounding volumes, represented as super-quadrics, under multi-view geometry constraints and the object scale prior. We validate the proposed system on ScanNet where we show a significant improvement over existing RGB-only methods.

* Accepted in ICCV 2021 as oral 
Viaarxiv icon

FroDO: From Detections to 3D Objects

May 11, 2020
Kejie Li, Martin Rünz, Meng Tang, Lingni Ma, Chen Kong, Tanner Schmidt, Ian Reid, Lourdes Agapito, Julian Straub, Steven Lovegrove, Richard Newcombe

Figure 1 for FroDO: From Detections to 3D Objects
Figure 2 for FroDO: From Detections to 3D Objects
Figure 3 for FroDO: From Detections to 3D Objects
Figure 4 for FroDO: From Detections to 3D Objects

Object-oriented maps are important for scene understanding since they jointly capture geometry and semantics, allow individual instantiation and meaningful reasoning about objects. We introduce FroDO, a method for accurate 3D reconstruction of object instances from RGB video that infers object location, pose and shape in a coarse-to-fine manner. Key to FroDO is to embed object shapes in a novel learnt space that allows seamless switching between sparse point cloud and dense DeepSDF decoding. Given an input sequence of localized RGB frames, FroDO first aggregates 2D detections to instantiate a category-aware 3D bounding box per object. A shape code is regressed using an encoder network before optimizing shape and pose further under the learnt shape priors using sparse and dense shape representations. The optimization uses multi-view geometric, photometric and silhouette losses. We evaluate on real-world datasets, including Pix3D, Redwood-OS, and ScanNet, for single-view, multi-view, and multi-object reconstruction.

* To be published in CVPR 2020. The first two authors contributed equally 
Viaarxiv icon

Deep Local Shapes: Learning Local SDF Priors for Detailed 3D Reconstruction

Apr 11, 2020
Rohan Chabra, Jan Eric Lenssen, Eddy Ilg, Tanner Schmidt, Julian Straub, Steven Lovegrove, Richard Newcombe

Figure 1 for Deep Local Shapes: Learning Local SDF Priors for Detailed 3D Reconstruction
Figure 2 for Deep Local Shapes: Learning Local SDF Priors for Detailed 3D Reconstruction
Figure 3 for Deep Local Shapes: Learning Local SDF Priors for Detailed 3D Reconstruction
Figure 4 for Deep Local Shapes: Learning Local SDF Priors for Detailed 3D Reconstruction

Efficiently reconstructing complex and intricate surfaces at scale is a long-standing goal in machine perception. To address this problem we introduce Deep Local Shapes (DeepLS), a deep shape representation that enables encoding and reconstruction of high-quality 3D shapes without prohibitive memory requirements. DeepLS replaces the dense volumetric signed distance function (SDF) representation used in traditional surface reconstruction systems with a set of locally learned continuous SDFs defined by a neural network, inspired by recent work such as DeepSDF. Unlike DeepSDF, which represents an object-level SDF with a neural network and a single latent code, we store a grid of independent latent codes, each responsible for storing information about surfaces in a small local neighborhood. This decomposition of scenes into local shapes simplifies the prior distribution that the network must learn, and also enables efficient inference. We demonstrate the effectiveness and generalization power of DeepLS by showing object shape encoding and reconstructions of full scenes, where DeepLS delivers high compression, accuracy, and local shape completion.

Viaarxiv icon