Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

William T. Freeman

SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting

Sep 02, 2021

Varun Jampani, Huiwen Chang, Kyle Sargent, Abhishek Kar, Richard Tucker, Michael Krainin, Dominik Kaeser, William T. Freeman, David Salesin, Brian Curless(+1 more)

Figure 1 for SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting

Figure 2 for SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting

Figure 3 for SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting

Figure 4 for SLIDE: Single Image 3D Photography with Soft Layering and Depth-aware Inpainting

Abstract:Single image 3D photography enables viewers to view a still image from novel viewpoints. Recent approaches combine monocular depth networks with inpainting networks to achieve compelling results. A drawback of these techniques is the use of hard depth layering, making them unable to model intricate appearance details such as thin hair-like structures. We present SLIDE, a modular and unified system for single image 3D photography that uses a simple yet effective soft layering strategy to better preserve appearance details in novel views. In addition, we propose a novel depth-aware training strategy for our inpainting module, better suited for the 3D photography task. The resulting SLIDE approach is modular, enabling the use of other components such as segmentation and matting for improved layering. At the same time, SLIDE uses an efficient layered depth formulation that only requires a single forward pass through the component networks to produce high quality 3D photos. Extensive experimental analysis on three view-synthesis datasets, in combination with user studies on in-the-wild image collections, demonstrate superior performance of our technique in comparison to existing strong baselines while being conceptually much simpler. Project page: https://varunjampani.github.io/slide

* ICCV 2021 (Oral); Project page: https://varunjampani.github.io/slide ; Video: https://www.youtube.com/watch?v=RQio7q-ueY8

Via

Access Paper or Ask Questions

What You Can Learn by Staring at a Blank Wall

Aug 30, 2021

Prafull Sharma, Miika Aittala, Yoav Y. Schechner, Antonio Torralba, Gregory W. Wornell, William T. Freeman, Fredo Durand

Figure 1 for What You Can Learn by Staring at a Blank Wall

Figure 2 for What You Can Learn by Staring at a Blank Wall

Figure 3 for What You Can Learn by Staring at a Blank Wall

Figure 4 for What You Can Learn by Staring at a Blank Wall

Abstract:We present a passive non-line-of-sight method that infers the number of people or activity of a person from the observation of a blank wall in an unknown room. Our technique analyzes complex imperceptible changes in indirect illumination in a video of the wall to reveal a signal that is correlated with motion in the hidden part of a scene. We use this signal to classify between zero, one, or two moving people, or the activity of a person in the hidden scene. We train two convolutional neural networks using data collected from 20 different scenes, and achieve an accuracy of $\approx94\%$ for both tasks in unseen test environments and real-time online settings. Unlike other passive non-line-of-sight methods, the technique does not rely on known occluders or controllable light sources, and generalizes to unknown rooms with no re-calibration. We analyze the generalization and robustness of our method with both real and synthetic data, and study the effect of the scene parameters on the signal quality.

Via

Access Paper or Ask Questions

Consistent Depth of Moving Objects in Video

Aug 02, 2021

Zhoutong Zhang, Forrester Cole, Richard Tucker, William T. Freeman, Tali Dekel

Figure 1 for Consistent Depth of Moving Objects in Video

Figure 2 for Consistent Depth of Moving Objects in Video

Figure 3 for Consistent Depth of Moving Objects in Video

Figure 4 for Consistent Depth of Moving Objects in Video

Abstract:We present a method to estimate depth of a dynamic scene, containing arbitrary moving objects, from an ordinary video captured with a moving camera. We seek a geometrically and temporally consistent solution to this underconstrained problem: the depth predictions of corresponding points across frames should induce plausible, smooth motion in 3D. We formulate this objective in a new test-time training framework where a depth-prediction CNN is trained in tandem with an auxiliary scene-flow prediction MLP over the entire input video. By recursively unrolling the scene-flow prediction MLP over varying time steps, we compute both short-range scene flow to impose local smooth motion priors directly in 3D, and long-range scene flow to impose multi-view consistency constraints with wide baselines. We demonstrate accurate and temporally coherent results on a variety of challenging videos containing diverse moving objects (pets, people, cars), as well as camera motion. Our depth maps give rise to a number of depth-and-motion aware video editing effects such as object and lighting insertion.

* ACM Trans. Graph., Vol. 40, No. 4, Article 148, August 2021
* Published at SIGGRAPH 2021

Via

Access Paper or Ask Questions

THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers

Jun 17, 2021

Mihai Zanfir, Andrei Zanfir, Eduard Gabriel Bazavan, William T. Freeman, Rahul Sukthankar, Cristian Sminchisescu

Figure 1 for THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers

Figure 2 for THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers

Figure 3 for THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers

Figure 4 for THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers

Abstract:We present THUNDR, a transformer-based deep neural network methodology to reconstruct the 3d pose and shape of people, given monocular RGB images. Key to our methodology is an intermediate 3d marker representation, where we aim to combine the predictive power of model-free-output architectures and the regularizing, anthropometrically-preserving properties of a statistical human surface model like GHUM -- a recently introduced, expressive full body statistical 3d human model, trained end-to-end. Our novel transformer-based prediction pipeline can focus on image regions relevant to the task, supports self-supervised regimes, and ensures that solutions are consistent with human anthropometry. We show state-of-the-art results on Human3.6M and 3DPW, for both the fully-supervised and the self-supervised models, for the task of inferring 3d human shape, joint positions, and global translation. Moreover, we observe very solid 3d reconstruction performance for difficult human poses collected in the wild.

Via

Access Paper or Ask Questions

Toward Automatic Interpretation of 3D Plots

Jun 14, 2021

Laura E. Brandt, William T. Freeman

Figure 1 for Toward Automatic Interpretation of 3D Plots

Figure 2 for Toward Automatic Interpretation of 3D Plots

Figure 3 for Toward Automatic Interpretation of 3D Plots

Figure 4 for Toward Automatic Interpretation of 3D Plots

Abstract:This paper explores the challenge of teaching a machine how to reverse-engineer the grid-marked surfaces used to represent data in 3D surface plots of two-variable functions. These are common in scientific and economic publications; and humans can often interpret them with ease, quickly gleaning general shape and curvature information from the simple collection of curves. While machines have no such visual intuition, they do have the potential to accurately extract the more detailed quantitative data that guided the surface's construction. We approach this problem by synthesizing a new dataset of 3D grid-marked surfaces (SurfaceGrid) and training a deep neural net to estimate their shape. Our algorithm successfully recovers shape information from synthetic 3D surface plots that have had axes and shading information removed, been rendered with a variety of grid types, and viewed from a range of viewpoints.

* 16 pages, 12 figures, accepted to the 16th International Conference on Document Analysis and Recognition (ICDAR21)

Via

Access Paper or Ask Questions

Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering

Jun 04, 2021

Vincent Sitzmann, Semon Rezchikov, William T. Freeman, Joshua B. Tenenbaum, Fredo Durand

Figure 1 for Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering

Figure 2 for Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering

Figure 3 for Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering

Figure 4 for Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering

Abstract:Inferring representations of 3D scenes from 2D observations is a fundamental problem of computer graphics, computer vision, and artificial intelligence. Emerging 3D-structured neural scene representations are a promising approach to 3D scene understanding. In this work, we propose a novel neural scene representation, Light Field Networks or LFNs, which represent both geometry and appearance of the underlying 3D scene in a 360-degree, four-dimensional light field parameterized via a neural implicit representation. Rendering a ray from an LFN requires only a *single* network evaluation, as opposed to hundreds of evaluations per ray for ray-marching or volumetric based renderers in 3D-structured neural scene representations. In the setting of simple scenes, we leverage meta-learning to learn a prior over LFNs that enables multi-view consistent light field reconstruction from as little as a single image observation. This results in dramatic reductions in time and memory complexity, and enables real-time rendering. The cost of storing a 360-degree light field via an LFN is two orders of magnitude lower than conventional methods such as the Lumigraph. Utilizing the analytical differentiability of neural implicit representations and a novel parameterization of light space, we further demonstrate the extraction of sparse depth maps from LFNs.

* First two authors contributed equally. Project website: https://vsitzmann.github.io/lfns/

Via

Access Paper or Ask Questions

NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown Illumination

Jun 03, 2021

Xiuming Zhang, Pratul P. Srinivasan, Boyang Deng, Paul Debevec, William T. Freeman, Jonathan T. Barron

Figure 1 for NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown Illumination

Figure 2 for NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown Illumination

Figure 3 for NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown Illumination

Figure 4 for NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown Illumination

Abstract:We address the problem of recovering the shape and spatially-varying reflectance of an object from posed multi-view images of the object illuminated by one unknown lighting condition. This enables the rendering of novel views of the object under arbitrary environment lighting and editing of the object's material properties. The key to our approach, which we call Neural Radiance Factorization (NeRFactor), is to distill the volumetric geometry of a Neural Radiance Field (NeRF) [Mildenhall et al. 2020] representation of the object into a surface representation and then jointly refine the geometry while solving for the spatially-varying reflectance and the environment lighting. Specifically, NeRFactor recovers 3D neural fields of surface normals, light visibility, albedo, and Bidirectional Reflectance Distribution Functions (BRDFs) without any supervision, using only a re-rendering loss, simple smoothness priors, and a data-driven BRDF prior learned from real-world BRDF measurements. By explicitly modeling light visibility, NeRFactor is able to separate shadows from albedo and synthesize realistic soft or hard shadows under arbitrary lighting conditions. NeRFactor is able to recover convincing 3D models for free-viewpoint relighting in this challenging and underconstrained capture setup for both synthetic and real scenes. Qualitative and quantitative experiments show that NeRFactor outperforms classic and deep learning-based state of the art across various tasks. Our code and data are available at people.csail.mit.edu/xiuming/projects/nerfactor/.

* Project Page: https://people.csail.mit.edu/xiuming/projects/nerfactor/

Via

Access Paper or Ask Questions

Omnimatte: Associating Objects and Their Effects in Video

May 14, 2021

Erika Lu, Forrester Cole, Tali Dekel, Andrew Zisserman, William T. Freeman, Michael Rubinstein

Figure 1 for Omnimatte: Associating Objects and Their Effects in Video

Figure 2 for Omnimatte: Associating Objects and Their Effects in Video

Figure 3 for Omnimatte: Associating Objects and Their Effects in Video

Figure 4 for Omnimatte: Associating Objects and Their Effects in Video

Abstract:Computer vision is increasingly effective at segmenting objects in images and videos; however, scene effects related to the objects---shadows, reflections, generated smoke, etc---are typically overlooked. Identifying such scene effects and associating them with the objects producing them is important for improving our fundamental understanding of visual scenes, and can also assist a variety of applications such as removing, duplicating, or enhancing objects in video. In this work, we take a step towards solving this novel problem of automatically associating objects with their effects in video. Given an ordinary video and a rough segmentation mask over time of one or more subjects of interest, we estimate an omnimatte for each subject---an alpha matte and color image that includes the subject along with all its related time-varying scene elements. Our model is trained only on the input video in a self-supervised manner, without any manual labels, and is generic---it produces omnimattes automatically for arbitrary objects and a variety of effects. We show results on real-world videos containing interactions between different types of subjects (cars, animals, people) and complex effects, ranging from semi-transparent elements such as smoke and reflections, to fully opaque effects such as objects attached to the subject.

* Accepted to CVPR 2021 Oral. Project webpage: https://omnimatte.github.io/

Via

Access Paper or Ask Questions

LASR: Learning Articulated Shape Reconstruction from a Monocular Video

May 06, 2021

Gengshan Yang, Deqing Sun, Varun Jampani, Daniel Vlasic, Forrester Cole, Huiwen Chang, Deva Ramanan, William T. Freeman, Ce Liu

Figure 1 for LASR: Learning Articulated Shape Reconstruction from a Monocular Video

Figure 2 for LASR: Learning Articulated Shape Reconstruction from a Monocular Video

Figure 3 for LASR: Learning Articulated Shape Reconstruction from a Monocular Video

Figure 4 for LASR: Learning Articulated Shape Reconstruction from a Monocular Video

Abstract:Remarkable progress has been made in 3D reconstruction of rigid structures from a video or a collection of images. However, it is still challenging to reconstruct nonrigid structures from RGB inputs, due to its under-constrained nature. While template-based approaches, such as parametric shape models, have achieved great success in modeling the "closed world" of known object categories, they cannot well handle the "open-world" of novel object categories or outlier shapes. In this work, we introduce a template-free approach to learn 3D shapes from a single video. It adopts an analysis-by-synthesis strategy that forward-renders object silhouette, optical flow, and pixel values to compare with video observations, which generates gradients to adjust the camera, shape and motion parameters. Without using a category-specific shape template, our method faithfully reconstructs nonrigid 3D structures from videos of human, animals, and objects of unknown classes. Code will be available at lasr-google.github.io .

* CVPR 2021. Project page: https://lasr-google.github.io/

Via

Access Paper or Ask Questions

AutoFlow: Learning a Better Training Set for Optical Flow

Apr 29, 2021

Deqing Sun, Daniel Vlasic, Charles Herrmann, Varun Jampani, Michael Krainin, Huiwen Chang, Ramin Zabih, William T. Freeman, Ce Liu

Figure 1 for AutoFlow: Learning a Better Training Set for Optical Flow

Figure 2 for AutoFlow: Learning a Better Training Set for Optical Flow

Figure 3 for AutoFlow: Learning a Better Training Set for Optical Flow

Figure 4 for AutoFlow: Learning a Better Training Set for Optical Flow

Abstract:Synthetic datasets play a critical role in pre-training CNN models for optical flow, but they are painstaking to generate and hard to adapt to new applications. To automate the process, we present AutoFlow, a simple and effective method to render training data for optical flow that optimizes the performance of a model on a target dataset. AutoFlow takes a layered approach to render synthetic data, where the motion, shape, and appearance of each layer are controlled by learnable hyperparameters. Experimental results show that AutoFlow achieves state-of-the-art accuracy in pre-training both PWC-Net and RAFT. Our code and data are available at https://autoflow-google.github.io .

* CVPR 2021

Via

Access Paper or Ask Questions