Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andrea Vedaldi

Novel-View Acoustic Synthesis

Jan 23, 2023

Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, Andrea Vedaldi

Figure 1 for Novel-View Acoustic Synthesis

Figure 2 for Novel-View Acoustic Synthesis

Figure 3 for Novel-View Acoustic Synthesis

Figure 4 for Novel-View Acoustic Synthesis

Abstract:We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benchmark this task, we collect two first-of-their-kind large-scale multi-view audio-visual datasets, one synthetic and one real. We show that our model successfully reasons about the spatial cues and synthesizes faithful audio on both datasets. To our knowledge, this work represents the very first formulation, dataset, and approach to solve the novel-view acoustic synthesis task, which has exciting potential applications ranging from AR/VR to art and design. Unlocked by this work, we believe that the future of novel-view synthesis is in multi-modal learning from videos.

* Project page: https://vision.cs.utexas.edu/projects/nvas

Via

Access Paper or Ask Questions

Self-Supervised Correspondence Estimation via Multiview Registration

Dec 06, 2022

Mohamed El Banani, Ignacio Rocco, David Novotny, Andrea Vedaldi, Natalia Neverova, Justin Johnson, Benjamin Graham

Abstract:Video provides us with the spatio-temporal consistency needed for visual learning. Recent approaches have utilized this signal to learn correspondence estimation from close-by frame pairs. However, by only relying on close-by frame pairs, those approaches miss out on the richer long-range consistency between distant overlapping frames. To address this, we propose a self-supervised approach for correspondence estimation that learns from multiview consistency in short RGB-D video sequences. Our approach combines pairwise correspondence estimation and registration with a novel SE(3) transformation synchronization algorithm. Our key insight is that self-supervised multiview registration allows us to obtain correspondences over longer time frames; increasing both the diversity and difficulty of sampled pairs. We evaluate our approach on indoor scenes for correspondence estimation and RGB-D pointcloud registration and find that we perform on-par with supervised approaches.

* Accepted to WACV 2023. Project page: https://mbanani.github.io/syncmatch/

Via

Access Paper or Ask Questions

MagicPony: Learning Articulated 3D Animals in the Wild

Nov 22, 2022

Shangzhe Wu, Ruining Li, Tomas Jakab, Christian Rupprecht, Andrea Vedaldi

Abstract:We consider the problem of learning a function that can estimate the 3D shape, articulation, viewpoint, texture, and lighting of an articulated animal like a horse, given a single test image. We present a new method, dubbed MagicPony, that learns this function purely from in-the-wild single-view images of the object category, with minimal assumptions about the topology of deformation. At its core is an implicit-explicit representation of articulated shape and appearance, combining the strengths of neural fields and meshes. In order to help the model understand an object's shape and pose, we distil the knowledge captured by an off-the-shelf self-supervised vision transformer and fuse it into the 3D model. To overcome common local optima in viewpoint estimation, we further introduce a new viewpoint sampling scheme that comes at no added training cost. Compared to prior works, we show significant quantitative and qualitative improvements on this challenging task. The model also demonstrates excellent generalisation in reconstructing abstract drawings and artefacts, despite the fact that it is only trained on real images.

* Project Page: https://3dmagicpony.github.io/

Via

Access Paper or Ask Questions

Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories

Nov 07, 2022

Samarth Sinha, Roman Shapovalov, Jeremy Reizenstein, Ignacio Rocco, Natalia Neverova, Andrea Vedaldi, David Novotny

Figure 1 for Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories

Figure 2 for Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories

Figure 3 for Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories

Figure 4 for Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories

Abstract:Obtaining photorealistic reconstructions of objects from sparse views is inherently ambiguous and can only be achieved by learning suitable reconstruction priors. Earlier works on sparse rigid object reconstruction successfully learned such priors from large datasets such as CO3D. In this paper, we extend this approach to dynamic objects. We use cats and dogs as a representative example and introduce Common Pets in 3D (CoP3D), a collection of crowd-sourced videos showing around 4,200 distinct pets. CoP3D is one of the first large-scale datasets for benchmarking non-rigid 3D reconstruction "in the wild". We also propose Tracker-NeRF, a method for learning 4D reconstruction from our dataset. At test time, given a small number of video frames of an unseen object, Tracker-NeRF predicts the trajectories of its 3D points and generates new views, interpolating viewpoint and time. Results on CoP3D reveal significantly better non-rigid new-view synthesis performance than existing baselines.

Via

Access Paper or Ask Questions

Neural Feature Fusion Fields: 3D Distillation of Self-Supervised 2D Image Representations

Sep 07, 2022

Vadim Tschernezki, Iro Laina, Diane Larlus, Andrea Vedaldi

Figure 1 for Neural Feature Fusion Fields: 3D Distillation of Self-Supervised 2D Image Representations

Figure 2 for Neural Feature Fusion Fields: 3D Distillation of Self-Supervised 2D Image Representations

Figure 3 for Neural Feature Fusion Fields: 3D Distillation of Self-Supervised 2D Image Representations

Figure 4 for Neural Feature Fusion Fields: 3D Distillation of Self-Supervised 2D Image Representations

Abstract:We present Neural Feature Fusion Fields (N3F), a method that improves dense 2D image feature extractors when the latter are applied to the analysis of multiple images reconstructible as a 3D scene. Given an image feature extractor, for example pre-trained using self-supervision, N3F uses it as a teacher to learn a student network defined in 3D space. The 3D student network is similar to a neural radiance field that distills said features and can be trained with the usual differentiable rendering machinery. As a consequence, N3F is readily applicable to most neural rendering formulations, including vanilla NeRF and its extensions to complex dynamic scenes. We show that our method not only enables semantic understanding in the context of scene-specific neural fields without the use of manual labels, but also consistently improves over the self-supervised 2D baselines. This is demonstrated by considering various tasks, such as 2D object retrieval, 3D segmentation, and scene editing, in diverse sequences, including long egocentric videos in the EPIC-KITCHENS benchmark.

* 3DV2022, Oral. Project page: https://www.robots.ox.ac.uk/~vadim/n3f/

Via

Access Paper or Ask Questions

Measuring the Interpretability of Unsupervised Representations via Quantized Reverse Probing

Sep 07, 2022

Iro Laina, Yuki M. Asano, Andrea Vedaldi

Figure 1 for Measuring the Interpretability of Unsupervised Representations via Quantized Reverse Probing

Figure 2 for Measuring the Interpretability of Unsupervised Representations via Quantized Reverse Probing

Figure 3 for Measuring the Interpretability of Unsupervised Representations via Quantized Reverse Probing

Figure 4 for Measuring the Interpretability of Unsupervised Representations via Quantized Reverse Probing

Abstract:Self-supervised visual representation learning has recently attracted significant research interest. While a common way to evaluate self-supervised representations is through transfer to various downstream tasks, we instead investigate the problem of measuring their interpretability, i.e. understanding the semantics encoded in raw representations. We formulate the latter as estimating the mutual information between the representation and a space of manually labelled concepts. To quantify this we introduce a decoding bottleneck: information must be captured by simple predictors, mapping concepts to clusters in representation space. This approach, which we call reverse linear probing, provides a single number sensitive to the semanticity of the representation. This measure is also able to detect when the representation contains combinations of concepts (e.g., "red apple") instead of just individual attributes ("red" and "apple" independently). Finally, we propose to use supervised classifiers to automatically label large datasets in order to enrich the space of concepts used for probing. We use our method to evaluate a large number of self-supervised representations, ranking them by interpretability, highlight the differences that emerge compared to the standard evaluation with linear probes and discuss several qualitative insights. Code at: {\scriptsize{\url{https://github.com/iro-cp/ssl-qrp}}}.

* Published at ICLR 2022. Appendix included, 26 pages

Via

Access Paper or Ask Questions

SNeS: Learning Probably Symmetric Neural Surfaces from Incomplete Data

Jun 13, 2022

Eldar Insafutdinov, Dylan Campbell, João F. Henriques, Andrea Vedaldi

Figure 1 for SNeS: Learning Probably Symmetric Neural Surfaces from Incomplete Data

Figure 2 for SNeS: Learning Probably Symmetric Neural Surfaces from Incomplete Data

Abstract:We present a method for the accurate 3D reconstruction of partly-symmetric objects. We build on the strengths of recent advances in neural reconstruction and rendering such as Neural Radiance Fields (NeRF). A major shortcoming of such approaches is that they fail to reconstruct any part of the object which is not clearly visible in the training image, which is often the case for in-the-wild images and videos. When evidence is lacking, structural priors such as symmetry can be used to complete the missing information. However, exploiting such priors in neural rendering is highly non-trivial: while geometry and non-reflective materials may be symmetric, shadows and reflections from the ambient scene are not symmetric in general. To address this, we apply a soft symmetry constraint to the 3D geometry and material properties, having factored appearance into lighting, albedo colour and reflectivity. We evaluate our method on the recently introduced CO3D dataset, focusing on the car category due to the challenge of reconstructing highly-reflective materials. We show that it can reconstruct unobserved regions with high fidelity and render high-quality novel view images.

* First two authors contributed equally

Via

Access Paper or Ask Questions

Guess What Moves: Unsupervised Video and Image Segmentation by Anticipating Motion

May 16, 2022

Subhabrata Choudhury, Laurynas Karazija, Iro Laina, Andrea Vedaldi, Christian Rupprecht

Figure 1 for Guess What Moves: Unsupervised Video and Image Segmentation by Anticipating Motion

Figure 2 for Guess What Moves: Unsupervised Video and Image Segmentation by Anticipating Motion

Figure 3 for Guess What Moves: Unsupervised Video and Image Segmentation by Anticipating Motion

Figure 4 for Guess What Moves: Unsupervised Video and Image Segmentation by Anticipating Motion

Abstract:Motion, measured via optical flow, provides a powerful cue to discover and learn objects in images and videos. However, compared to using appearance, it has some blind spots, such as the fact that objects become invisible if they do not move. In this work, we propose an approach that combines the strengths of motion-based and appearance-based segmentation. We propose to supervise an image segmentation network, tasking it with predicting regions that are likely to contain simple motion patterns, and thus likely to correspond to objects. We apply this network in two modes. In the unsupervised video segmentation mode, the network is trained on a collection of unlabelled videos, using the learning process itself as an algorithm to segment these videos. In the unsupervised image segmentation model, the network is learned using videos and applied to segment independent still images. With this, we obtain strong empirical results in unsupervised video and image segmentation, significantly outperforming the state of the art on benchmarks such as DAVIS, sometimes with a $5\%$ IoU gap.

Via

Access Paper or Ask Questions

Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization

May 16, 2022

Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, Andrea Vedaldi

Figure 1 for Deep Spectral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization

Abstract:Unsupervised localization and segmentation are long-standing computer vision challenges that involve decomposing an image into semantically-meaningful segments without any labeled data. These tasks are particularly interesting in an unsupervised setting due to the difficulty and cost of obtaining dense image annotations, but existing unsupervised approaches struggle with complex scenes containing multiple objects. Differently from existing methods, which are purely based on deep learning, we take inspiration from traditional spectral segmentation methods by reframing image decomposition as a graph partitioning problem. Specifically, we examine the eigenvectors of the Laplacian of a feature affinity matrix from self-supervised networks. We find that these eigenvectors already decompose an image into meaningful segments, and can be readily used to localize objects in a scene. Furthermore, by clustering the features associated with these segments across a dataset, we can obtain well-delineated, nameable regions, i.e. semantic segmentations. Experiments on complex datasets (Pascal VOC, MS-COCO) demonstrate that our simple spectral method outperforms the state-of-the-art in unsupervised localization and segmentation by a significant margin. Furthermore, our method can be readily used for a variety of complex image editing tasks, such as background removal and compositing.

* Published at CVPR 2022. Project Page: https://lukemelas.github.io/deep-spectral-segmentation

Via

Access Paper or Ask Questions

End-to-End Visual Editing with a Generatively Pre-Trained Artist

May 03, 2022

Andrew Brown, Cheng-Yang Fu, Omkar Parkhi, Tamara L. Berg, Andrea Vedaldi

Figure 1 for End-to-End Visual Editing with a Generatively Pre-Trained Artist

Figure 2 for End-to-End Visual Editing with a Generatively Pre-Trained Artist

Figure 3 for End-to-End Visual Editing with a Generatively Pre-Trained Artist

Figure 4 for End-to-End Visual Editing with a Generatively Pre-Trained Artist

Abstract:We consider the targeted image editing problem: blending a region in a source image with a driver image that specifies the desired change. Differently from prior works, we solve this problem by learning a conditional probability distribution of the edits, end-to-end. Training such a model requires addressing a fundamental technical challenge: the lack of example edits for training. To this end, we propose a self-supervised approach that simulates edits by augmenting off-the-shelf images in a target domain. The benefits are remarkable: implemented as a state-of-the-art auto-regressive transformer, our approach is simple, sidesteps difficulties with previous methods based on GAN-like priors, obtains significantly better edits, and is efficient. Furthermore, we show that different blending effects can be learned by an intuitive control of the augmentation process, with no other changes required to the model architecture. We demonstrate the superiority of this approach across several datasets in extensive quantitative and qualitative experiments, including human studies, significantly outperforming prior work.

Via

Access Paper or Ask Questions