Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ayush Tewari

Max Planck Institute for Informatics, Saarland Informatics Campus

Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

May 18, 2023

Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, Christian Theobalt

Figure 1 for Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

Figure 2 for Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

Figure 3 for Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

Figure 4 for Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

Abstract:Synthesizing visual content that meets users' needs often requires flexible and precise controllability of the pose, shape, expression, and layout of the generated objects. Existing approaches gain controllability of generative adversarial networks (GANs) via manually annotated training data or a prior 3D model, which often lack flexibility, precision, and generality. In this work, we study a powerful yet much less explored way of controlling GANs, that is, to "drag" any points of the image to precisely reach target points in a user-interactive manner, as shown in Fig.1. To achieve this, we propose DragGAN, which consists of two main components: 1) a feature-based motion supervision that drives the handle point to move towards the target position, and 2) a new point tracking approach that leverages the discriminative generator features to keep localizing the position of the handle points. Through DragGAN, anyone can deform an image with precise control over where pixels go, thus manipulating the pose, shape, expression, and layout of diverse categories such as animals, cars, humans, landscapes, etc. As these manipulations are performed on the learned generative image manifold of a GAN, they tend to produce realistic outputs even for challenging scenarios such as hallucinating occluded content and deforming shapes that consistently follow the object's rigidity. Both qualitative and quantitative comparisons demonstrate the advantage of DragGAN over prior approaches in the tasks of image manipulation and point tracking. We also showcase the manipulation of real images through GAN inversion.

* Accepted to SIGGRAPH 2023. Project page: https://vcai.mpi-inf.mpg.de/projects/DragGAN/

Via

Access Paper or Ask Questions

Learning to Render Novel Views from Wide-Baseline Stereo Pairs

Apr 17, 2023

Yilun Du, Cameron Smith, Ayush Tewari, Vincent Sitzmann

Figure 1 for Learning to Render Novel Views from Wide-Baseline Stereo Pairs

Figure 2 for Learning to Render Novel Views from Wide-Baseline Stereo Pairs

Figure 3 for Learning to Render Novel Views from Wide-Baseline Stereo Pairs

Figure 4 for Learning to Render Novel Views from Wide-Baseline Stereo Pairs

Abstract:We introduce a method for novel view synthesis given only a single wide-baseline stereo image pair. In this challenging regime, 3D scene points are regularly observed only once, requiring prior-based reconstruction of scene geometry and appearance. We find that existing approaches to novel view synthesis from sparse observations fail due to recovering incorrect 3D geometry and due to the high cost of differentiable rendering that precludes their scaling to large-scale training. We take a step towards resolving these shortcomings by formulating a multi-view transformer encoder, proposing an efficient, image-space epipolar line sampling scheme to assemble image features for a target ray, and a lightweight cross-attention-based renderer. Our contributions enable training of our method on a large-scale real-world dataset of indoor and outdoor scenes. We demonstrate that our method learns powerful multi-view geometry priors while reducing the rendering time. We conduct extensive comparisons on held-out test scenes across two real-world datasets, significantly outperforming prior work on novel view synthesis from sparse image observations and achieving multi-view-consistent novel view synthesis.

* CVPR 2023, Project Webpage: https://yilundu.github.io/wide_baseline/, Last Two Authors Equal Advising

Via

Access Paper or Ask Questions

ConceptFusion: Open-set Multimodal 3D Mapping

Feb 15, 2023

Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari(+6 more)

Abstract:Building 3D maps of the environment is central to robot navigation, planning, and interaction with objects in a scene. Most existing approaches that integrate semantic concepts with 3D maps largely remain confined to the closed-set setting: they can only reason about a finite set of concepts, pre-defined at training time. Further, these maps can only be queried using class labels, or in recent work, using text prompts. We address both these issues with ConceptFusion, a scene representation that is (1) fundamentally open-set, enabling reasoning beyond a closed set of concepts and (ii) inherently multimodal, enabling a diverse range of possible queries to the 3D map, from language, to images, to audio, to 3D geometry, all working in concert. ConceptFusion leverages the open-set capabilities of today's foundation models pre-trained on internet-scale data to reason about concepts across modalities such as natural language, images, and audio. We demonstrate that pixel-aligned open-set features can be fused into 3D maps via traditional SLAM and multi-view fusion approaches. This enables effective zero-shot spatial reasoning, not needing any additional training or finetuning, and retains long-tailed concepts better than supervised approaches, outperforming them by more than 40% margin on 3D IoU. We extensively evaluate ConceptFusion on a number of real-world datasets, simulated home environments, a real-world tabletop manipulation task, and an autonomous driving platform. We showcase new avenues for blending foundation models with 3D open-set multimodal mapping. For more information, visit our project page https://concept-fusion.github.io or watch our 5-minute explainer video https://www.youtube.com/watch?v=rkXgws8fiDs

Via

Access Paper or Ask Questions

gCoRF: Generative Compositional Radiance Fields

Oct 31, 2022

Mallikarjun BR, Ayush Tewari, Xingang Pan, Mohamed Elgharib, Christian Theobalt

Abstract:3D generative models of objects enable photorealistic image synthesis with 3D control. Existing methods model the scene as a global scene representation, ignoring the compositional aspect of the scene. Compositional reasoning can enable a wide variety of editing applications, in addition to enabling generalizable 3D reasoning. In this paper, we present a compositional generative model, where each semantic part of the object is represented as an independent 3D representation learned from only in-the-wild 2D data. We start with a global generative model (GAN) and learn to decompose it into different semantic parts using supervision from 2D segmentation masks. We then learn to composite independently sampled parts in order to create coherent global scenes. Different parts can be independently sampled while keeping the rest of the object fixed. We evaluate our method on a wide variety of objects and parts and demonstrate editing applications.

* https://vcai.mpi-inf.mpg.de/projects/gCoRF/

Via

Access Paper or Ask Questions

Neural Radiance Transfer Fields for Relightable Novel-view Synthesis with Global Illumination

Jul 27, 2022

Linjie Lyu, Ayush Tewari, Thomas Leimkuehler, Marc Habermann, Christian Theobalt

Figure 1 for Neural Radiance Transfer Fields for Relightable Novel-view Synthesis with Global Illumination

Abstract:Given a set of images of a scene, the re-rendering of this scene from novel views and lighting conditions is an important and challenging problem in Computer Vision and Graphics. On the one hand, most existing works in Computer Vision usually impose many assumptions regarding the image formation process, e.g. direct illumination and predefined materials, to make scene parameter estimation tractable. On the other hand, mature Computer Graphics tools allow modeling of complex photo-realistic light transport given all the scene parameters. Combining these approaches, we propose a method for scene relighting under novel views by learning a neural precomputed radiance transfer function, which implicitly handles global illumination effects using novel environment maps. Our method can be solely supervised on a set of real images of the scene under a single unknown lighting condition. To disambiguate the task during training, we tightly integrate a differentiable path tracer in the training process and propose a combination of a synthesized OLAT and a real image loss. Results show that the recovered disentanglement of scene parameters improves significantly over the current state of the art and, thus, also our re-rendering results are more realistic and accurate.

Via

Access Paper or Ask Questions

Seeing 3D Objects in a Single Image via Self-Supervised Static-Dynamic Disentanglement

Jul 22, 2022

Prafull Sharma, Ayush Tewari, Yilun Du, Sergey Zakharov, Rares Ambrus, Adrien Gaidon, William T. Freeman, Fredo Durand, Joshua B. Tenenbaum, Vincent Sitzmann

Figure 1 for Seeing 3D Objects in a Single Image via Self-Supervised Static-Dynamic Disentanglement

Figure 2 for Seeing 3D Objects in a Single Image via Self-Supervised Static-Dynamic Disentanglement

Figure 3 for Seeing 3D Objects in a Single Image via Self-Supervised Static-Dynamic Disentanglement

Figure 4 for Seeing 3D Objects in a Single Image via Self-Supervised Static-Dynamic Disentanglement

Abstract:Human perception reliably identifies movable and immovable parts of 3D scenes, and completes the 3D structure of objects and background from incomplete observations. We learn this skill not via labeled examples, but simply by observing objects move. In this work, we propose an approach that observes unlabeled multi-view videos at training time and learns to map a single image observation of a complex scene, such as a street with cars, to a 3D neural scene representation that is disentangled into movable and immovable parts while plausibly completing its 3D structure. We separately parameterize movable and immovable scene parts via 2D neural ground plans. These ground plans are 2D grids of features aligned with the ground plane that can be locally decoded into 3D neural radiance fields. Our model is trained self-supervised via neural rendering. We demonstrate that the structure inherent to our disentangled 3D representation enables a variety of downstream tasks in street-scale 3D scenes using simple heuristics, such as extraction of object-centric 3D representations, novel view synthesis, instance segmentation, and 3D bounding box prediction, highlighting its value as a backbone for data-efficient 3D scene understanding models. This disentanglement further enables scene editing via object manipulation such as deletion, insertion, and rigid-body motion.

* Project page: https://prafullsharma.net/see3d/

Via

Access Paper or Ask Questions

GAN2X: Non-Lambertian Inverse Rendering of Image GANs

Jun 18, 2022

Xingang Pan, Ayush Tewari, Lingjie Liu, Christian Theobalt

Figure 1 for GAN2X: Non-Lambertian Inverse Rendering of Image GANs

Figure 2 for GAN2X: Non-Lambertian Inverse Rendering of Image GANs

Figure 3 for GAN2X: Non-Lambertian Inverse Rendering of Image GANs

Figure 4 for GAN2X: Non-Lambertian Inverse Rendering of Image GANs

Abstract:2D images are observations of the 3D physical world depicted with the geometry, material, and illumination components. Recovering these underlying intrinsic components from 2D images, also known as inverse rendering, usually requires a supervised setting with paired images collected from multiple viewpoints and lighting conditions, which is resource-demanding. In this work, we present GAN2X, a new method for unsupervised inverse rendering that only uses unpaired images for training. Unlike previous Shape-from-GAN approaches that mainly focus on 3D shapes, we take the first attempt to also recover non-Lambertian material properties by exploiting the pseudo paired data generated by a GAN. To achieve precise inverse rendering, we devise a specularity-aware neural surface representation that continuously models the geometry and material properties. A shading-based refinement technique is adopted to further distill information in the target image and recover more fine details. Experiments demonstrate that GAN2X can accurately decompose 2D images to 3D shape, albedo, and specular properties for different object categories, and achieves the state-of-the-art performance for unsupervised single-view 3D face reconstruction. We also show its applications in downstream tasks including real image editing and lifting 2D GANs to decomposed 3D GANs.

* The video demo is available at the project page: https://people.mpi-inf.mpg.de/~xpan/GAN2X/

Via

Access Paper or Ask Questions

HDSDF: Hybrid Directional and Signed Distance Functions for Fast Inverse Rendering

Mar 30, 2022

Tarun Yenamandra, Ayush Tewari, Nan Yang, Florian Bernard, Christian Theobalt, Daniel Cremers

Figure 1 for HDSDF: Hybrid Directional and Signed Distance Functions for Fast Inverse Rendering

Figure 2 for HDSDF: Hybrid Directional and Signed Distance Functions for Fast Inverse Rendering

Figure 3 for HDSDF: Hybrid Directional and Signed Distance Functions for Fast Inverse Rendering

Figure 4 for HDSDF: Hybrid Directional and Signed Distance Functions for Fast Inverse Rendering

Abstract:Implicit neural representations of 3D shapes form strong priors that are useful for various applications, such as single and multiple view 3D reconstruction. A downside of existing neural representations is that they require multiple network evaluations for rendering, which leads to high computational costs. This limitation forms a bottleneck particularly in the context of inverse problems, such as image-based 3D reconstruction. To address this issue, in this paper (i) we propose a novel hybrid 3D object representation based on a signed distance function (SDF) that we augment with a directional distance function (DDF), so that we can predict distances to the object surface from any point on a sphere enclosing the object. Moreover, (ii) using the proposed hybrid representation we address the multi-view consistency problem common in existing DDF representations. We evaluate our novel hybrid representation on the task of single-view depth reconstruction and show that our method is several times faster compared to competing methods, while at the same time achieving better reconstruction accuracy.

Via

Access Paper or Ask Questions

Disentangled3D: Learning a 3D Generative Model with Disentangled Geometry and Appearance from Monocular Images

Mar 29, 2022

Ayush Tewari, Mallikarjun B R, Xingang Pan, Ohad Fried, Maneesh Agrawala, Christian Theobalt

Figure 1 for Disentangled3D: Learning a 3D Generative Model with Disentangled Geometry and Appearance from Monocular Images

Figure 2 for Disentangled3D: Learning a 3D Generative Model with Disentangled Geometry and Appearance from Monocular Images

Figure 3 for Disentangled3D: Learning a 3D Generative Model with Disentangled Geometry and Appearance from Monocular Images

Figure 4 for Disentangled3D: Learning a 3D Generative Model with Disentangled Geometry and Appearance from Monocular Images

Abstract:Learning 3D generative models from a dataset of monocular images enables self-supervised 3D reasoning and controllable synthesis. State-of-the-art 3D generative models are GANs which use neural 3D volumetric representations for synthesis. Images are synthesized by rendering the volumes from a given camera. These models can disentangle the 3D scene from the camera viewpoint in any generated image. However, most models do not disentangle other factors of image formation, such as geometry and appearance. In this paper, we design a 3D GAN which can learn a disentangled model of objects, just from monocular observations. Our model can disentangle the geometry and appearance variations in the scene, i.e., we can independently sample from the geometry and appearance spaces of the generative model. This is achieved using a novel non-rigid deformable scene formulation. A 3D volume which represents an object instance is computed as a non-rigidly deformed canonical 3D volume. Our method learns the canonical volume, as well as its deformations, jointly during training. This formulation also helps us improve the disentanglement between the 3D scene and the camera viewpoints using a novel pose regularization loss defined on the 3D deformation field. In addition, we further model the inverse deformations, enabling the computation of dense correspondences between images generated by our model. Finally, we design an approach to embed real images into the latent space of our disentangled generative model, enabling editing of real images.

* CVPR 2022

Via

Access Paper or Ask Questions

Advances in Neural Rendering

Nov 10, 2021

Ayush Tewari, Justus Thies, Ben Mildenhall, Pratul Srinivasan, Edgar Tretschk, Yifan Wang, Christoph Lassner, Vincent Sitzmann, Ricardo Martin-Brualla, Stephen Lombardi(+7 more)

Abstract:Synthesizing photo-realistic images and videos is at the heart of computer graphics and has been the focus of decades of research. Traditionally, synthetic images of a scene are generated using rendering algorithms such as rasterization or ray tracing, which take specifically defined representations of geometry and material properties as input. Collectively, these inputs define the actual scene and what is rendered, and are referred to as the scene representation (where a scene consists of one or more objects). Example scene representations are triangle meshes with accompanied textures (e.g., created by an artist), point clouds (e.g., from a depth sensor), volumetric grids (e.g., from a CT scan), or implicit surface functions (e.g., truncated signed distance fields). The reconstruction of such a scene representation from observations using differentiable rendering losses is known as inverse graphics or inverse rendering. Neural rendering is closely related, and combines ideas from classical computer graphics and machine learning to create algorithms for synthesizing images from real-world observations. Neural rendering is a leap forward towards the goal of synthesizing photo-realistic image and video content. In recent years, we have seen immense progress in this field through hundreds of publications that show different ways to inject learnable components into the rendering pipeline. This state-of-the-art report on advances in neural rendering focuses on methods that combine classical rendering principles with learned 3D scene representations, often now referred to as neural scene representations. A key advantage of these methods is that they are 3D-consistent by design, enabling applications such as novel viewpoint synthesis of a captured scene. In addition to methods that handle static scenes, we cover neural scene representations for modeling non-rigidly deforming objects...

* 29 pages, 14 figures, 5 tables

Via

Access Paper or Ask Questions