Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shubham Tulsiani

Monocular Dynamic View Synthesis: A Reality Check

Oct 24, 2022

Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, Angjoo Kanazawa

Figure 1 for Monocular Dynamic View Synthesis: A Reality Check

Figure 2 for Monocular Dynamic View Synthesis: A Reality Check

Figure 3 for Monocular Dynamic View Synthesis: A Reality Check

Figure 4 for Monocular Dynamic View Synthesis: A Reality Check

Abstract:We study the recent progress on dynamic view synthesis (DVS) from monocular video. Though existing approaches have demonstrated impressive results, we show a discrepancy between the practical capture process and the existing experimental protocols, which effectively leaks in multi-view signals during training. We define effective multi-view factors (EMFs) to quantify the amount of multi-view signal present in the input capture sequence based on the relative camera-scene motion. We introduce two new metrics: co-visibility masked image metrics and correspondence accuracy, which overcome the issue in existing protocols. We also propose a new iPhone dataset that includes more diverse real-life deformation sequences. Using our proposed experimental protocol, we show that the state-of-the-art approaches observe a 1-2 dB drop in masked PSNR in the absence of multi-view cues and 4-5 dB drop when modeling complex motion. Code and data can be found at https://hangg7.com/dycheck.

* NeurIPS 2022. Project page: https://hangg7.com/dycheck. Code: https://github.com/KAIR-BAIR/dycheck

Via

Access Paper or Ask Questions

RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild

Aug 11, 2022

Jason Y. Zhang, Deva Ramanan, Shubham Tulsiani

Figure 1 for RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild

Figure 2 for RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild

Figure 3 for RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild

Figure 4 for RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild

Abstract:We describe a data-driven method for inferring the camera viewpoints given multiple images of an arbitrary object. This task is a core component of classic geometric pipelines such as SfM and SLAM, and also serves as a vital pre-processing requirement for contemporary neural approaches (e.g. NeRF) to object reconstruction and view synthesis. In contrast to existing correspondence-driven methods that do not perform well given sparse views, we propose a top-down prediction based approach for estimating camera viewpoints. Our key technical insight is the use of an energy-based formulation for representing distributions over relative camera rotations, thus allowing us to explicitly represent multiple camera modes arising from object symmetries or views. Leveraging these relative predictions, we jointly estimate a consistent set of camera rotations from multiple images. We show that our approach outperforms state-of-the-art SfM and SLAM methods given sparse images on both seen and unseen categories. Further, our probabilistic approach significantly outperforms directly regressing relative poses, suggesting that modeling multimodality is important for coherent joint reconstruction. We demonstrate that our system can be a stepping stone toward in-the-wild reconstruction from multi-view datasets. The project page with code and videos can be found at https://jasonyzhang.com/relpose.

* In ECCV 2022

Via

Access Paper or Ask Questions

What's in your hands? 3D Reconstruction of Generic Objects in Hands

Apr 14, 2022

Yufei Ye, Abhinav Gupta, Shubham Tulsiani

Figure 1 for What's in your hands? 3D Reconstruction of Generic Objects in Hands

Figure 2 for What's in your hands? 3D Reconstruction of Generic Objects in Hands

Figure 3 for What's in your hands? 3D Reconstruction of Generic Objects in Hands

Figure 4 for What's in your hands? 3D Reconstruction of Generic Objects in Hands

Abstract:Our work aims to reconstruct hand-held objects given a single RGB image. In contrast to prior works that typically assume known 3D templates and reduce the problem to 3D pose estimation, our work reconstructs generic hand-held object without knowing their 3D templates. Our key insight is that hand articulation is highly predictive of the object shape, and we propose an approach that conditionally reconstructs the object based on the articulation and the visual input. Given an image depicting a hand-held object, we first use off-the-shelf systems to estimate the underlying hand pose and then infer the object shape in a normalized hand-centric coordinate frame. We parameterized the object by signed distance which are inferred by an implicit network which leverages the information from both visual feature and articulation-aware coordinates to process a query point. We perform experiments across three datasets and show that our method consistently outperforms baselines and is able to reconstruct a diverse set of objects. We analyze the benefits and robustness of explicit articulation conditioning and also show that this allows the hand pose estimation to further improve in test-time optimization.

* accepted to CVPR 22

Via

Access Paper or Ask Questions

Pre-train, Self-train, Distill: A simple recipe for Supersizing 3D Reconstruction

Apr 07, 2022

Kalyan Vasudev Alwala, Abhinav Gupta, Shubham Tulsiani

Figure 1 for Pre-train, Self-train, Distill: A simple recipe for Supersizing 3D Reconstruction

Figure 2 for Pre-train, Self-train, Distill: A simple recipe for Supersizing 3D Reconstruction

Figure 3 for Pre-train, Self-train, Distill: A simple recipe for Supersizing 3D Reconstruction

Figure 4 for Pre-train, Self-train, Distill: A simple recipe for Supersizing 3D Reconstruction

Abstract:Our work learns a unified model for single-view 3D reconstruction of objects from hundreds of semantic categories. As a scalable alternative to direct 3D supervision, our work relies on segmented image collections for learning 3D of generic categories. Unlike prior works that use similar supervision but learn independent category-specific models from scratch, our approach of learning a unified model simplifies the training process while also allowing the model to benefit from the common structure across categories. Using image collections from standard recognition datasets, we show that our approach allows learning 3D inference for over 150 object categories. We evaluate using two datasets and qualitatively and quantitatively show that our unified reconstruction approach improves over prior category-specific reconstruction baselines. Our final 3D reconstruction model is also capable of zero-shot inference on images from unseen object categories and we empirically show that increasing the number of training categories improves the reconstruction quality.

* To appear in CVPR 22. Project page: https://shubhtuls.github.io/ss3d/

Via

Access Paper or Ask Questions

AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation

Mar 17, 2022

Paritosh Mittal, Yen-Chi Cheng, Maneesh Singh, Shubham Tulsiani

Figure 1 for AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation

Figure 2 for AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation

Figure 3 for AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation

Figure 4 for AutoSDF: Shape Priors for 3D Completion, Reconstruction and Generation

Abstract:Powerful priors allow us to perform inference with insufficient information. In this paper, we propose an autoregressive prior for 3D shapes to solve multimodal 3D tasks such as shape completion, reconstruction, and generation. We model the distribution over 3D shapes as a non-sequential autoregressive distribution over a discretized, low-dimensional, symbolic grid-like latent representation of 3D shapes. This enables us to represent distributions over 3D shapes conditioned on information from an arbitrary set of spatially anchored query locations and thus perform shape completion in such arbitrary settings (e.g., generating a complete chair given only a view of the back leg). We also show that the learned autoregressive prior can be leveraged for conditional tasks such as single-view reconstruction and language-based generation. This is achieved by learning task-specific naive conditionals which can be approximated by light-weight models trained on minimal paired data. We validate the effectiveness of the proposed method using both quantitative and qualitative evaluation and show that the proposed method outperforms the specialized state-of-the-art methods trained for individual tasks. The project page with code and video visualizations can be found at https://yccyenchicheng.github.io/AutoSDF/.

* In CVPR 2022. The first two authors contributed equally to this work. Project: https://yccyenchicheng.github.io/AutoSDF/

Via

Access Paper or Ask Questions

A Differentiable Recipe for Learning Visual Non-Prehensile Planar Manipulation

Nov 09, 2021

Bernardo Aceituno, Alberto Rodriguez, Shubham Tulsiani, Abhinav Gupta, Mustafa Mukadam

Figure 1 for A Differentiable Recipe for Learning Visual Non-Prehensile Planar Manipulation

Figure 2 for A Differentiable Recipe for Learning Visual Non-Prehensile Planar Manipulation

Figure 3 for A Differentiable Recipe for Learning Visual Non-Prehensile Planar Manipulation

Figure 4 for A Differentiable Recipe for Learning Visual Non-Prehensile Planar Manipulation

Abstract:Specifying tasks with videos is a powerful technique towards acquiring novel and general robot skills. However, reasoning over mechanics and dexterous interactions can make it challenging to scale learning contact-rich manipulation. In this work, we focus on the problem of visual non-prehensile planar manipulation: given a video of an object in planar motion, find contact-aware robot actions that reproduce the same object motion. We propose a novel architecture, Differentiable Learning for Manipulation (\ours), that combines video decoding neural models with priors from contact mechanics by leveraging differentiable optimization and finite difference based simulation. Through extensive simulated experiments, we investigate the interplay between traditional model-based techniques and modern deep learning approaches. We find that our modular and fully differentiable architecture performs better than learning-only methods on unseen objects and motions. \url{https://github.com/baceituno/dlm}.

* Presented at CORL 2021

Via

Access Paper or Ask Questions

No RL, No Simulation: Learning to Navigate without Navigating

Oct 22, 2021

Meera Hahn, Devendra Chaplot, Shubham Tulsiani, Mustafa Mukadam, James M. Rehg, Abhinav Gupta

Figure 1 for No RL, No Simulation: Learning to Navigate without Navigating

Figure 2 for No RL, No Simulation: Learning to Navigate without Navigating

Figure 3 for No RL, No Simulation: Learning to Navigate without Navigating

Figure 4 for No RL, No Simulation: Learning to Navigate without Navigating

Abstract:Most prior methods for learning navigation policies require access to simulation environments, as they need online policy interaction and rely on ground-truth maps for rewards. However, building simulators is expensive (requires manual effort for each and every scene) and creates challenges in transferring learned policies to robotic platforms in the real-world, due to the sim-to-real domain gap. In this paper, we pose a simple question: Do we really need active interaction, ground-truth maps or even reinforcement-learning (RL) in order to solve the image-goal navigation task? We propose a self-supervised approach to learn to navigate from only passive videos of roaming. Our approach, No RL, No Simulator (NRNS), is simple and scalable, yet highly effective. NRNS outperforms RL-based formulations by a significant margin. We present NRNS as a strong baseline for any future image-based navigation tasks that use RL or Simulation.

Via

Access Paper or Ask Questions

NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild

Oct 18, 2021

Jason Y. Zhang, Gengshan Yang, Shubham Tulsiani, Deva Ramanan

Figure 1 for NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild

Figure 2 for NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild

Figure 3 for NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild

Figure 4 for NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild

Abstract:Recent history has seen a tremendous growth of work exploring implicit representations of geometry and radiance, popularized through Neural Radiance Fields (NeRF). Such works are fundamentally based on a (implicit) volumetric representation of occupancy, allowing them to model diverse scene structure including translucent objects and atmospheric obscurants. But because the vast majority of real-world scenes are composed of well-defined surfaces, we introduce a surface analog of such implicit models called Neural Reflectance Surfaces (NeRS). NeRS learns a neural shape representation of a closed surface that is diffeomorphic to a sphere, guaranteeing water-tight reconstructions. Even more importantly, surface parameterizations allow NeRS to learn (neural) bidirectional surface reflectance functions (BRDFs) that factorize view-dependent appearance into environmental illumination, diffuse color (albedo), and specular "shininess." Finally, rather than illustrating our results on synthetic scenes or controlled in-the-lab capture, we assemble a novel dataset of multi-view images from online marketplaces for selling goods. Such "in-the-wild" multi-view image sets pose a number of challenges, including a small number of views with unknown/rough camera estimates. We demonstrate that surface-based neural reconstructions enable learning from such data, outperforming volumetric neural rendering-based reconstructions. We hope that NeRS serves as a first step toward building scalable, high-quality libraries of real-world shape, materials, and illumination. The project page with code and video visualizations can be found at https://jasonyzhang.com/ners.

* In NeurIPS 2021. v2-3: Fixed minor typos

Via

Access Paper or Ask Questions

PixelTransformer: Sample Conditioned Signal Generation

Mar 29, 2021

Shubham Tulsiani, Abhinav Gupta

Figure 1 for PixelTransformer: Sample Conditioned Signal Generation

Figure 2 for PixelTransformer: Sample Conditioned Signal Generation

Figure 3 for PixelTransformer: Sample Conditioned Signal Generation

Figure 4 for PixelTransformer: Sample Conditioned Signal Generation

Abstract:We propose a generative model that can infer a distribution for the underlying spatial signal conditioned on sparse samples e.g. plausible images given a few observed pixels. In contrast to sequential autoregressive generative models, our model allows conditioning on arbitrary samples and can answer distributional queries for any location. We empirically validate our approach across three image datasets and show that we learn to generate diverse and meaningful samples, with the distribution variance reducing given more observed pixels. We also show that our approach is applicable beyond images and can allow generating other types of spatial outputs e.g. polynomials, 3D shapes, and videos.

* Project page: https://shubhtuls.github.io/PixelTransformer/

Via

Access Paper or Ask Questions

Shelf-Supervised Mesh Prediction in the Wild

Feb 11, 2021

Yufei Ye, Shubham Tulsiani, Abhinav Gupta

Figure 1 for Shelf-Supervised Mesh Prediction in the Wild

Figure 2 for Shelf-Supervised Mesh Prediction in the Wild

Figure 3 for Shelf-Supervised Mesh Prediction in the Wild

Figure 4 for Shelf-Supervised Mesh Prediction in the Wild

Abstract:We aim to infer 3D shape and pose of object from a single image and propose a learning-based approach that can train from unstructured image collections, supervised by only segmentation outputs from off-the-shelf recognition systems (i.e. 'shelf-supervised'). We first infer a volumetric representation in a canonical frame, along with the camera pose. We enforce the representation geometrically consistent with both appearance and masks, and also that the synthesized novel views are indistinguishable from image collections. The coarse volumetric prediction is then converted to a mesh-based representation, which is further refined in the predicted camera frame. These two steps allow both shape-pose factorization from image collections and per-instance reconstruction in finer details. We examine the method on both synthetic and real-world datasets and demonstrate its scalability on 50 categories in the wild, an order of magnitude more classes than existing works.

Via

Access Paper or Ask Questions