Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vincent Sitzmann

MIT EECS

FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient Descent

Apr 23, 2024

Cameron Smith, David Charatan, Ayush Tewari, Vincent Sitzmann

Figure 1 for FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient Descent

Figure 2 for FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient Descent

Figure 3 for FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient Descent

Figure 4 for FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient Descent

Abstract:This paper introduces FlowMap, an end-to-end differentiable method that solves for precise camera poses, camera intrinsics, and per-frame dense depth of a video sequence. Our method performs per-video gradient-descent minimization of a simple least-squares objective that compares the optical flow induced by depth, intrinsics, and poses against correspondences obtained via off-the-shelf optical flow and point tracking. Alongside the use of point tracks to encourage long-term geometric consistency, we introduce differentiable re-parameterizations of depth, intrinsics, and pose that are amenable to first-order optimization. We empirically show that camera parameters and dense depth recovered by our method enable photo-realistic novel view synthesis on 360-degree trajectories using Gaussian Splatting. Our method not only far outperforms prior gradient-descent based bundle adjustment methods, but surprisingly performs on par with COLMAP, the state-of-the-art SfM method, on the downstream task of 360-degree novel view synthesis (even though our method is purely gradient-descent based, fully differentiable, and presents a complete departure from conventional SfM).

* Project website: https://cameronosmith.github.io/flowmap/

Via

Access Paper or Ask Questions

DittoGym: Learning to Control Soft Shape-Shifting Robots

Jan 29, 2024

Suning Huang, Boyuan Chen, Huazhe Xu, Vincent Sitzmann

Figure 1 for DittoGym: Learning to Control Soft Shape-Shifting Robots

Figure 2 for DittoGym: Learning to Control Soft Shape-Shifting Robots

Figure 3 for DittoGym: Learning to Control Soft Shape-Shifting Robots

Figure 4 for DittoGym: Learning to Control Soft Shape-Shifting Robots

Abstract:Robot co-design, where the morphology of a robot is optimized jointly with a learned policy to solve a specific task, is an emerging area of research. It holds particular promise for soft robots, which are amenable to novel manufacturing techniques that can realize learned morphologies and actuators. Inspired by nature and recent novel robot designs, we propose to go a step further and explore the novel reconfigurable robots, defined as robots that can change their morphology within their lifetime. We formalize control of reconfigurable soft robots as a high-dimensional reinforcement learning (RL) problem. We unify morphology change, locomotion, and environment interaction in the same action space, and introduce an appropriate, coarse-to-fine curriculum that enables us to discover policies that accomplish fine-grained control of the resulting robots. We also introduce DittoGym, a comprehensive RL benchmark for reconfigurable soft robots that require fine-grained morphology changes to accomplish the tasks. Finally, we evaluate our proposed coarse-to-fine algorithm on DittoGym and demonstrate robots that learn to change their morphology several times within a sequence, uniquely enabled by our RL algorithm. More results are available at https://dittogym.github.io.

Via

Access Paper or Ask Questions

pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

Dec 21, 2023

David Charatan, Sizhe Li, Andrea Tagliasacchi, Vincent Sitzmann

Figure 1 for pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

Figure 2 for pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

Figure 3 for pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

Figure 4 for pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

Abstract:We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D radiance fields parameterized by 3D Gaussian primitives from pairs of images. Our model features real-time and memory-efficient rendering for scalable training as well as fast 3D reconstruction at inference time. To overcome local minima inherent to sparse and locally supported representations, we predict a dense probability distribution over 3D and sample Gaussian means from that probability distribution. We make this sampling operation differentiable via a reparameterization trick, allowing us to back-propagate gradients through the Gaussian splatting representation. We benchmark our method on wide-baseline novel view synthesis on the real-world RealEstate10k and ACID datasets, where we outperform state-of-the-art light field transformers and accelerate rendering by 2.5 orders of magnitude while reconstructing an interpretable and editable 3D radiance field.

* Project page: https://dcharatan.github.io/pixelsplat

Via

Access Paper or Ask Questions

Intrinsic Image Diffusion for Single-view Material Estimation

Dec 19, 2023

Peter Kocsis, Vincent Sitzmann, Matthias Nießner

Abstract:We present Intrinsic Image Diffusion, a generative model for appearance decomposition of indoor scenes. Given a single input view, we sample multiple possible material explanations represented as albedo, roughness, and metallic maps. Appearance decomposition poses a considerable challenge in computer vision due to the inherent ambiguity between lighting and material properties and the lack of real datasets. To address this issue, we advocate for a probabilistic formulation, where instead of attempting to directly predict the true material properties, we employ a conditional generative model to sample from the solution space. Furthermore, we show that utilizing the strong learned prior of recent diffusion models trained on large-scale real-world images can be adapted to material estimation and highly improves the generalization to real images. Our method produces significantly sharper, more consistent, and more detailed materials, outperforming state-of-the-art methods by $1.5dB$ on PSNR and by $45\%$ better FID score on albedo prediction. We demonstrate the effectiveness of our approach through experiments on both synthetic and real-world datasets.

* Project page: https://peter-kocsis.github.io/IntrinsicImageDiffusion/ Video: https://youtu.be/lz0meJlj5cA

Via

Access Paper or Ask Questions

Variational Barycentric Coordinates

Oct 05, 2023

Ana Dodik, Oded Stein, Vincent Sitzmann, Justin Solomon

Figure 1 for Variational Barycentric Coordinates

Figure 2 for Variational Barycentric Coordinates

Figure 3 for Variational Barycentric Coordinates

Figure 4 for Variational Barycentric Coordinates

Abstract:We propose a variational technique to optimize for generalized barycentric coordinates that offers additional control compared to existing models. Prior work represents barycentric coordinates using meshes or closed-form formulae, in practice limiting the choice of objective function. In contrast, we directly parameterize the continuous function that maps any coordinate in a polytope's interior to its barycentric coordinates using a neural field. This formulation is enabled by our theoretical characterization of barycentric coordinates, which allows us to construct neural fields that parameterize the entire function class of valid coordinates. We demonstrate the flexibility of our model using a variety of objective functions, including multiple smoothness and deformation-aware energies; as a side contribution, we also present mathematically-justified means of measuring and minimizing objectives like total variation on discontinuous neural fields. We offer a practical acceleration strategy, present a thorough validation of our algorithm, and demonstrate several applications.

* https://anadodik.github.io/

Via

Access Paper or Ask Questions

Approaching human 3D shape perception with neurally mappable models

Sep 07, 2023

Thomas P. O'Connell, Tyler Bonnen, Yoni Friedman, Ayush Tewari, Josh B. Tenenbaum, Vincent Sitzmann, Nancy Kanwisher

Abstract:Humans effortlessly infer the 3D shape of objects. What computations underlie this ability? Although various computational models have been proposed, none of them capture the human ability to match object shape across viewpoints. Here, we ask whether and how this gap might be closed. We begin with a relatively novel class of computational models, 3D neural fields, which encapsulate the basic principles of classic analysis-by-synthesis in a deep neural network (DNN). First, we find that a 3D Light Field Network (3D-LFN) supports 3D matching judgments well aligned to humans for within-category comparisons, adversarially-defined comparisons that accentuate the 3D failure cases of standard DNN models, and adversarially-defined comparisons for algorithmically generated shapes with no category structure. We then investigate the source of the 3D-LFN's ability to achieve human-aligned performance through a series of computational experiments. Exposure to multiple viewpoints of objects during training and a multi-view learning objective are the primary factors behind model-human alignment; even conventional DNN architectures come much closer to human behavior when trained with multi-view objectives. Finally, we find that while the models trained with multi-view learning objectives are able to partially generalize to new object categories, they fall short of human alignment. This work provides a foundation for understanding human shape inferences within neurally mappable computational architectures.

Via

Access Paper or Ask Questions

Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision

Jun 20, 2023

Ayush Tewari, Tianwei Yin, George Cazenavette, Semon Rezchikov, Joshua B. Tenenbaum, Frédo Durand, William T. Freeman, Vincent Sitzmann

Figure 1 for Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision

Figure 2 for Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision

Figure 3 for Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision

Figure 4 for Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision

Abstract:Denoising diffusion models are a powerful type of generative models used to capture complex distributions of real-world signals. However, their applicability is limited to scenarios where training samples are readily available, which is not always the case in real-world applications. For example, in inverse graphics, the goal is to generate samples from a distribution of 3D scenes that align with a given image, but ground-truth 3D scenes are unavailable and only 2D images are accessible. To address this limitation, we propose a novel class of denoising diffusion probabilistic models that learn to sample from distributions of signals that are never directly observed. Instead, these signals are measured indirectly through a known differentiable forward model, which produces partial observations of the unknown signal. Our approach involves integrating the forward model directly into the denoising process. This integration effectively connects the generative modeling of observations with the generative modeling of the underlying signals, allowing for end-to-end training of a conditional generative model over signals. During inference, our approach enables sampling from the distribution of underlying signals that are consistent with a given partial observation. We demonstrate the effectiveness of our method on three challenging computer vision tasks. For instance, in the context of inverse graphics, our model enables direct sampling from the distribution of 3D scenes that align with a single 2D input image.

* Project page: https://diffusion-with-forward-models.github.io/

Via

Access Paper or Ask Questions

FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses via Pixel-Aligned Scene Flow

May 31, 2023

Cameron Smith, Yilun Du, Ayush Tewari, Vincent Sitzmann

Abstract:Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning. The key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion, which is prohibitively expensive to run at scale. We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass. We estimate poses by first lifting frame-to-frame optical flow to 3D scene flow via differentiable rendering, preserving locality and shift-equivariance of the image processing backbone. SE(3) camera pose estimation is then performed via a weighted least-squares fit to the scene flow field. This formulation enables us to jointly supervise pose estimation and a generalizable neural scene representation via re-rendering the input video, and thus, train end-to-end and fully self-supervised on real-world video datasets. We demonstrate that our method performs robustly on diverse, real-world video, notably on sequences traditionally challenging to optimization-based pose estimation techniques.

* Project website: http://cameronosmith.github.io/flowcam

Via

Access Paper or Ask Questions

Learning to Render Novel Views from Wide-Baseline Stereo Pairs

Apr 17, 2023

Yilun Du, Cameron Smith, Ayush Tewari, Vincent Sitzmann

Figure 1 for Learning to Render Novel Views from Wide-Baseline Stereo Pairs

Figure 2 for Learning to Render Novel Views from Wide-Baseline Stereo Pairs

Figure 3 for Learning to Render Novel Views from Wide-Baseline Stereo Pairs

Figure 4 for Learning to Render Novel Views from Wide-Baseline Stereo Pairs

Abstract:We introduce a method for novel view synthesis given only a single wide-baseline stereo image pair. In this challenging regime, 3D scene points are regularly observed only once, requiring prior-based reconstruction of scene geometry and appearance. We find that existing approaches to novel view synthesis from sparse observations fail due to recovering incorrect 3D geometry and due to the high cost of differentiable rendering that precludes their scaling to large-scale training. We take a step towards resolving these shortcomings by formulating a multi-view transformer encoder, proposing an efficient, image-space epipolar line sampling scheme to assemble image features for a target ray, and a lightweight cross-attention-based renderer. Our contributions enable training of our method on a large-scale real-world dataset of indoor and outdoor scenes. We demonstrate that our method learns powerful multi-view geometry priors while reducing the rendering time. We conduct extensive comparisons on held-out test scenes across two real-world datasets, significantly outperforming prior work on novel view synthesis from sparse image observations and achieving multi-view-consistent novel view synthesis.

* CVPR 2023, Project Webpage: https://yilundu.github.io/wide_baseline/, Last Two Authors Equal Advising

Via

Access Paper or Ask Questions

DeLiRa: Self-Supervised Depth, Light, and Radiance Fields

Apr 06, 2023

Vitor Guizilini, Igor Vasiljevic, Jiading Fang, Rares Ambrus, Sergey Zakharov, Vincent Sitzmann, Adrien Gaidon

Figure 1 for DeLiRa: Self-Supervised Depth, Light, and Radiance Fields

Figure 2 for DeLiRa: Self-Supervised Depth, Light, and Radiance Fields

Figure 3 for DeLiRa: Self-Supervised Depth, Light, and Radiance Fields

Figure 4 for DeLiRa: Self-Supervised Depth, Light, and Radiance Fields

Abstract:Differentiable volumetric rendering is a powerful paradigm for 3D reconstruction and novel view synthesis. However, standard volume rendering approaches struggle with degenerate geometries in the case of limited viewpoint diversity, a common scenario in robotics applications. In this work, we propose to use the multi-view photometric objective from the self-supervised depth estimation literature as a geometric regularizer for volumetric rendering, significantly improving novel view synthesis without requiring additional information. Building upon this insight, we explore the explicit modeling of scene geometry using a generalist Transformer, jointly learning a radiance field as well as depth and light fields with a set of shared latent codes. We demonstrate that sharing geometric information across tasks is mutually beneficial, leading to improvements over single-task learning without an increase in network complexity. Our DeLiRa architecture achieves state-of-the-art results on the ScanNet benchmark, enabling high quality volumetric rendering as well as real-time novel view and depth synthesis in the limited viewpoint diversity setting.

* Project page: https://sites.google.com/view/tri-delira

Via

Access Paper or Ask Questions