Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andrea Vedaldi

Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Aug 08, 2024

Ruining Li, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi

Figure 1 for Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Figure 2 for Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Figure 3 for Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Figure 4 for Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Abstract:We present Puppet-Master, an interactive video generative model that can serve as a motion prior for part-level dynamics. At test time, given a single image and a sparse set of motion trajectories (i.e., drags), Puppet-Master can synthesize a video depicting realistic part-level motion faithful to the given drag interactions. This is achieved by fine-tuning a large-scale pre-trained video diffusion model, for which we propose a new conditioning architecture to inject the dragging control effectively. More importantly, we introduce the all-to-first attention mechanism, a drop-in replacement for the widely adopted spatial attention modules, which significantly improves generation quality by addressing the appearance and background issues in existing models. Unlike other motion-conditioned video generators that are trained on in-the-wild videos and mostly move an entire object, Puppet-Master is learned from Objaverse-Animation-HQ, a new dataset of curated part-level motion clips. We propose a strategy to automatically filter out sub-optimal animations and augment the synthetic renderings with meaningful motion trajectories. Puppet-Master generalizes well to real images across various categories and outperforms existing methods in a zero-shot manner on a real-world benchmark. See our project page for more results: vgg-puppetmaster.github.io.

* Project page: https://vgg-puppetmaster.github.io/

Via

Access Paper or Ask Questions

SHIC: Shape-Image Correspondences with no Keypoint Supervision

Jul 26, 2024

Aleksandar Shtedritski, Christian Rupprecht, Andrea Vedaldi

Abstract:Canonical surface mapping generalizes keypoint detection by assigning each pixel of an object to a corresponding point in a 3D template. Popularised by DensePose for the analysis of humans, authors have since attempted to apply the concept to more categories, but with limited success due to the high cost of manual supervision. In this work, we introduce SHIC, a method to learn canonical maps without manual supervision which achieves better results than supervised methods for most categories. Our idea is to leverage foundation computer vision models such as DINO and Stable Diffusion that are open-ended and thus possess excellent priors over natural categories. SHIC reduces the problem of estimating image-to-template correspondences to predicting image-to-image correspondences using features from the foundation models. The reduction works by matching images of the object to non-photorealistic renders of the template, which emulates the process of collecting manual annotations for this task. These correspondences are then used to supervise high-quality canonical maps for any object of interest. We also show that image generators can further improve the realism of the template views, which provide an additional source of supervision for the model.

* ECCV 2024. Project website https://www.robots.ox.ac.uk/~vgg/research/shic/

Via

Access Paper or Ask Questions

Meta 3D AssetGen: Text-to-Mesh Generation with High-Quality Geometry, Texture, and PBR Materials

Jul 02, 2024

Yawar Siddiqui, Tom Monnier, Filippos Kokkinos, Mahendra Kariya, Yanir Kleiman, Emilien Garreau, Oran Gafni, Natalia Neverova, Andrea Vedaldi, Roman Shapovalov(+1 more)

Figure 1 for Meta 3D AssetGen: Text-to-Mesh Generation with High-Quality Geometry, Texture, and PBR Materials

Figure 2 for Meta 3D AssetGen: Text-to-Mesh Generation with High-Quality Geometry, Texture, and PBR Materials

Figure 3 for Meta 3D AssetGen: Text-to-Mesh Generation with High-Quality Geometry, Texture, and PBR Materials

Figure 4 for Meta 3D AssetGen: Text-to-Mesh Generation with High-Quality Geometry, Texture, and PBR Materials

Abstract:We present Meta 3D AssetGen (AssetGen), a significant advancement in text-to-3D generation which produces faithful, high-quality meshes with texture and material control. Compared to works that bake shading in the 3D object's appearance, AssetGen outputs physically-based rendering (PBR) materials, supporting realistic relighting. AssetGen generates first several views of the object with factored shaded and albedo appearance channels, and then reconstructs colours, metalness and roughness in 3D, using a deferred shading loss for efficient supervision. It also uses a sign-distance function to represent 3D shape more reliably and introduces a corresponding loss for direct shape supervision. This is implemented using fused kernels for high memory efficiency. After mesh extraction, a texture refinement transformer operating in UV space significantly improves sharpness and details. AssetGen achieves 17% improvement in Chamfer Distance and 40% in LPIPS over the best concurrent work for few-view reconstruction, and a human preference of 72% over the best industry competitors of comparable speed, including those that support PBR. Project page with generated assets: https://assetgen.github.io

* Project Page: https://assetgen.github.io

Via

Access Paper or Ask Questions

Meta 3D Gen

Jul 02, 2024

Raphael Bensadoun, Tom Monnier, Yanir Kleiman, Filippos Kokkinos, Yawar Siddiqui, Mahendra Kariya, Omri Harosh, Roman Shapovalov, Benjamin Graham, Emilien Garreau(+10 more)

Abstract:We introduce Meta 3D Gen (3DGen), a new state-of-the-art, fast pipeline for text-to-3D asset generation. 3DGen offers 3D asset creation with high prompt fidelity and high-quality 3D shapes and textures in under a minute. It supports physically-based rendering (PBR), necessary for 3D asset relighting in real-world applications. Additionally, 3DGen supports generative retexturing of previously generated (or artist-created) 3D shapes using additional textual inputs provided by the user. 3DGen integrates key technical components, Meta 3D AssetGen and Meta 3D TextureGen, that we developed for text-to-3D and text-to-texture generation, respectively. By combining their strengths, 3DGen represents 3D objects simultaneously in three ways: in view space, in volumetric space, and in UV (or texture) space. The integration of these two techniques achieves a win rate of 68% with respect to the single-stage model. We compare 3DGen to numerous industry baselines, and show that it outperforms them in terms of prompt fidelity and visual quality for complex textual prompts, while being significantly faster.

Via

Access Paper or Ask Questions

Meta 3D TextureGen: Fast and Consistent Texture Generation for 3D Objects

Jul 02, 2024

Raphael Bensadoun, Yanir Kleiman, Idan Azuri, Omri Harosh, Andrea Vedaldi, Natalia Neverova, Oran Gafni

Figure 1 for Meta 3D TextureGen: Fast and Consistent Texture Generation for 3D Objects

Figure 2 for Meta 3D TextureGen: Fast and Consistent Texture Generation for 3D Objects

Figure 3 for Meta 3D TextureGen: Fast and Consistent Texture Generation for 3D Objects

Figure 4 for Meta 3D TextureGen: Fast and Consistent Texture Generation for 3D Objects

Abstract:The recent availability and adaptability of text-to-image models has sparked a new era in many related domains that benefit from the learned text priors as well as high-quality and fast generation capabilities, one of which is texture generation for 3D objects. Although recent texture generation methods achieve impressive results by using text-to-image networks, the combination of global consistency, quality, and speed, which is crucial for advancing texture generation to real-world applications, remains elusive. To that end, we introduce Meta 3D TextureGen: a new feedforward method comprised of two sequential networks aimed at generating high-quality and globally consistent textures for arbitrary geometries of any complexity degree in less than 20 seconds. Our method achieves state-of-the-art results in quality and speed by conditioning a text-to-image model on 3D semantics in 2D space and fusing them into a complete and high-resolution UV texture map, as demonstrated by extensive qualitative and quantitative evaluations. In addition, we introduce a texture enhancement network that is capable of up-scaling any texture by an arbitrary ratio, producing 4k pixel resolution textures.

Via

Access Paper or Ask Questions

Flash3D: Feed-Forward Generalisable 3D Scene Reconstruction from a Single Image

Jun 06, 2024

Stanislaw Szymanowicz, Eldar Insafutdinov, Chuanxia Zheng, Dylan Campbell, João F. Henriques, Christian Rupprecht, Andrea Vedaldi

Abstract:In this paper, we propose Flash3D, a method for scene reconstruction and novel view synthesis from a single image which is both very generalisable and efficient. For generalisability, we start from a "foundation" model for monocular depth estimation and extend it to a full 3D shape and appearance reconstructor. For efficiency, we base this extension on feed-forward Gaussian Splatting. Specifically, we predict a first layer of 3D Gaussians at the predicted depth, and then add additional layers of Gaussians that are offset in space, allowing the model to complete the reconstruction behind occlusions and truncations. Flash3D is very efficient, trainable on a single GPU in a day, and thus accessible to most researchers. It achieves state-of-the-art results when trained and tested on RealEstate10k. When transferred to unseen datasets like NYU it outperforms competitors by a large margin. More impressively, when transferred to KITTI, Flash3D achieves better PSNR than methods trained specifically on that dataset. In some instances, it even outperforms recent methods that use multiple views as input. Code, models, demo, and more results are available at https://www.robots.ox.ac.uk/~vgg/research/flash3d/.

* Project page: https://www.robots.ox.ac.uk/~vgg/research/flash3d/

Via

Access Paper or Ask Questions

Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting

Apr 30, 2024

Paul Engstler, Andrea Vedaldi, Iro Laina, Christian Rupprecht

Figure 1 for Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting

Figure 2 for Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting

Figure 3 for Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting

Figure 4 for Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting

Abstract:3D scene generation has quickly become a challenging new research direction, fueled by consistent improvements of 2D generative diffusion models. Most prior work in this area generates scenes by iteratively stitching newly generated frames with existing geometry. These works often depend on pre-trained monocular depth estimators to lift the generated images into 3D, fusing them with the existing scene representation. These approaches are then often evaluated via a text metric, measuring the similarity between the generated images and a given text prompt. In this work, we make two fundamental contributions to the field of 3D scene generation. First, we note that lifting images to 3D with a monocular depth estimation model is suboptimal as it ignores the geometry of the existing scene. We thus introduce a novel depth completion model, trained via teacher distillation and self-training to learn the 3D fusion process, resulting in improved geometric coherence of the scene. Second, we introduce a new benchmarking scheme for scene generation methods that is based on ground truth geometry, and thus measures the quality of the structure of the scene.

* Project page: https://research.paulengstler.com/invisible-stitch/

Via

Access Paper or Ask Questions

Lightplane: Highly-Scalable Components for Neural 3D Fields

Apr 30, 2024

Ang Cao, Justin Johnson, Andrea Vedaldi, David Novotny

Figure 1 for Lightplane: Highly-Scalable Components for Neural 3D Fields

Figure 2 for Lightplane: Highly-Scalable Components for Neural 3D Fields

Figure 3 for Lightplane: Highly-Scalable Components for Neural 3D Fields

Figure 4 for Lightplane: Highly-Scalable Components for Neural 3D Fields

Abstract:Contemporary 3D research, particularly in reconstruction and generation, heavily relies on 2D images for inputs or supervision. However, current designs for these 2D-3D mapping are memory-intensive, posing a significant bottleneck for existing methods and hindering new applications. In response, we propose a pair of highly scalable components for 3D neural fields: Lightplane Render and Splatter, which significantly reduce memory usage in 2D-3D mapping. These innovations enable the processing of vastly more and higher resolution images with small memory and computational costs. We demonstrate their utility in various applications, from benefiting single-scene optimization with image-level losses to realizing a versatile pipeline for dramatically scaling 3D reconstruction and generation. Code: \url{https://github.com/facebookresearch/lightplane}.

* Project Page: https://lightplane.github.io/ Code: https://github.com/facebookresearch/lightplane

Via

Access Paper or Ask Questions

DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing

Apr 29, 2024

Minghao Chen, Iro Laina, Andrea Vedaldi

Figure 1 for DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing

Figure 2 for DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing

Figure 3 for DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing

Figure 4 for DGE: Direct Gaussian 3D Editing by Consistent Multi-view Editing

Abstract:We consider the problem of editing 3D objects and scenes based on open-ended language instructions. The established paradigm to solve this problem is to use a 2D image generator or editor to guide the 3D editing process. However, this is often slow as it requires do update a computationally expensive 3D representations such as a neural radiance field, and to do so by using contradictory guidance from a 2D model which is inherently not multi-view consistent. We thus introduce the Direct Gaussian Editor (DGE), a method that addresses these issues in two ways. First, we modify a given high-quality image editor like InstructPix2Pix to be multi-view consistent. We do so by utilizing a training-free approach which integrates cues from the underlying 3D geometry of the scene. Second, given a multi-view consistent edited sequence of images of the object, we directly and efficiently optimize the 3D object representation, which is based on 3D Gaussian Splatting. Because it does not require to apply edits incrementally and iteratively, DGE is significantly more efficient than existing approaches, and comes with other perks such as allowing selective editing of parts of the scene.

* Project Page: https://silent-chen.github.io/DGE/

Via

Access Paper or Ask Questions

DragAPart: Learning a Part-Level Motion Prior for Articulated Objects

Mar 22, 2024

Ruining Li, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi

Figure 1 for DragAPart: Learning a Part-Level Motion Prior for Articulated Objects

Figure 2 for DragAPart: Learning a Part-Level Motion Prior for Articulated Objects

Figure 3 for DragAPart: Learning a Part-Level Motion Prior for Articulated Objects

Figure 4 for DragAPart: Learning a Part-Level Motion Prior for Articulated Objects

Abstract:We introduce DragAPart, a method that, given an image and a set of drags as input, can generate a new image of the same object in a new state, compatible with the action of the drags. Differently from prior works that focused on repositioning objects, DragAPart predicts part-level interactions, such as opening and closing a drawer. We study this problem as a proxy for learning a generalist motion model, not restricted to a specific kinematic structure or object category. To this end, we start from a pre-trained image generator and fine-tune it on a new synthetic dataset, Drag-a-Move, which we introduce. Combined with a new encoding for the drags and dataset randomization, the new model generalizes well to real images and different categories. Compared to prior motion-controlled generators, we demonstrate much better part-level motion understanding.

* Project page: https://dragapart.github.io/

Via

Access Paper or Ask Questions