Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Natalia Neverova

DynamicStereo: Consistent Dynamic Depth from Stereo Videos

May 03, 2023

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, Christian Rupprecht

Figure 1 for DynamicStereo: Consistent Dynamic Depth from Stereo Videos

Figure 2 for DynamicStereo: Consistent Dynamic Depth from Stereo Videos

Figure 3 for DynamicStereo: Consistent Dynamic Depth from Stereo Videos

Figure 4 for DynamicStereo: Consistent Dynamic Depth from Stereo Videos

Abstract:We consider the problem of reconstructing a dynamic scene observed from a stereo camera. Most existing methods for depth from stereo treat different stereo frames independently, leading to temporally inconsistent depth predictions. Temporal consistency is especially important for immersive AR or VR scenarios, where flickering greatly diminishes the user experience. We propose DynamicStereo, a novel transformer-based architecture to estimate disparity for stereo videos. The network learns to pool information from neighboring frames to improve the temporal consistency of its predictions. Our architecture is designed to process stereo videos efficiently through divided attention layers. We also introduce Dynamic Replica, a new benchmark dataset containing synthetic videos of people and animals in scanned environments, which provides complementary training and evaluation data for dynamic stereo closer to real applications than existing datasets. Training with this dataset further improves the quality of predictions of our proposed DynamicStereo as well as prior methods. Finally, it acts as a benchmark for consistent stereo methods.

* CVPR 2023; project page available at https://dynamic-stereo.github.io/

Via

Access Paper or Ask Questions

Real-time volumetric rendering of dynamic humans

Mar 21, 2023

Ignacio Rocco, Iurii Makarov, Filippos Kokkinos, David Novotny, Benjamin Graham, Natalia Neverova, Andrea Vedaldi

Figure 1 for Real-time volumetric rendering of dynamic humans

Figure 2 for Real-time volumetric rendering of dynamic humans

Figure 3 for Real-time volumetric rendering of dynamic humans

Figure 4 for Real-time volumetric rendering of dynamic humans

Abstract:We present a method for fast 3D reconstruction and real-time rendering of dynamic humans from monocular videos with accompanying parametric body fits. Our method can reconstruct a dynamic human in less than 3h using a single GPU, compared to recent state-of-the-art alternatives that take up to 72h. These speedups are obtained by using a lightweight deformation model solely based on linear blend skinning, and an efficient factorized volumetric representation for modeling the shape and color of the person in canonical pose. Moreover, we propose a novel local ray marching rendering which, by exploiting standard GPU hardware and without any baking or conversion of the radiance field, allows visualizing the neural human on a mobile VR device at 40 frames per second with minimal loss of visual quality. Our experimental evaluation shows superior or competitive results with state-of-the art methods while obtaining large training speedup, using a simple model, and achieving real-time rendering.

* Project page: https://real-time-humans.github.io/

Via

Access Paper or Ask Questions

Novel-View Acoustic Synthesis

Jan 23, 2023

Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, Andrea Vedaldi

Figure 1 for Novel-View Acoustic Synthesis

Figure 2 for Novel-View Acoustic Synthesis

Figure 3 for Novel-View Acoustic Synthesis

Figure 4 for Novel-View Acoustic Synthesis

Abstract:We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benchmark this task, we collect two first-of-their-kind large-scale multi-view audio-visual datasets, one synthetic and one real. We show that our model successfully reasons about the spatial cues and synthesizes faithful audio on both datasets. To our knowledge, this work represents the very first formulation, dataset, and approach to solve the novel-view acoustic synthesis task, which has exciting potential applications ranging from AR/VR to art and design. Unlocked by this work, we believe that the future of novel-view synthesis is in multi-modal learning from videos.

* Project page: https://vision.cs.utexas.edu/projects/nvas

Via

Access Paper or Ask Questions

Self-Supervised Correspondence Estimation via Multiview Registration

Dec 06, 2022

Mohamed El Banani, Ignacio Rocco, David Novotny, Andrea Vedaldi, Natalia Neverova, Justin Johnson, Benjamin Graham

Figure 1 for Self-Supervised Correspondence Estimation via Multiview Registration

Figure 2 for Self-Supervised Correspondence Estimation via Multiview Registration

Figure 3 for Self-Supervised Correspondence Estimation via Multiview Registration

Figure 4 for Self-Supervised Correspondence Estimation via Multiview Registration

Abstract:Video provides us with the spatio-temporal consistency needed for visual learning. Recent approaches have utilized this signal to learn correspondence estimation from close-by frame pairs. However, by only relying on close-by frame pairs, those approaches miss out on the richer long-range consistency between distant overlapping frames. To address this, we propose a self-supervised approach for correspondence estimation that learns from multiview consistency in short RGB-D video sequences. Our approach combines pairwise correspondence estimation and registration with a novel SE(3) transformation synchronization algorithm. Our key insight is that self-supervised multiview registration allows us to obtain correspondences over longer time frames; increasing both the diversity and difficulty of sampled pairs. We evaluate our approach on indoor scenes for correspondence estimation and RGB-D pointcloud registration and find that we perform on-par with supervised approaches.

* Accepted to WACV 2023. Project page: https://mbanani.github.io/syncmatch/

Via

Access Paper or Ask Questions

Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories

Nov 07, 2022

Samarth Sinha, Roman Shapovalov, Jeremy Reizenstein, Ignacio Rocco, Natalia Neverova, Andrea Vedaldi, David Novotny

Figure 1 for Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories

Figure 2 for Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories

Figure 3 for Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories

Figure 4 for Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories

Abstract:Obtaining photorealistic reconstructions of objects from sparse views is inherently ambiguous and can only be achieved by learning suitable reconstruction priors. Earlier works on sparse rigid object reconstruction successfully learned such priors from large datasets such as CO3D. In this paper, we extend this approach to dynamic objects. We use cats and dogs as a representative example and introduce Common Pets in 3D (CoP3D), a collection of crowd-sourced videos showing around 4,200 distinct pets. CoP3D is one of the first large-scale datasets for benchmarking non-rigid 3D reconstruction "in the wild". We also propose Tracker-NeRF, a method for learning 4D reconstruction from our dataset. At test time, given a small number of video frames of an unseen object, Tracker-NeRF predicts the trajectories of its 3D points and generates new views, interpolating viewpoint and time. Results on CoP3D reveal significantly better non-rigid new-view synthesis performance than existing baselines.

Via

Access Paper or Ask Questions

Filtered-CoPhy: Unsupervised Learning of Counterfactual Physics in Pixel Space

Feb 01, 2022

Steeven Janny, Fabien Baradel, Natalia Neverova, Madiha Nadri, Greg Mori, Christian Wolf

Figure 1 for Filtered-CoPhy: Unsupervised Learning of Counterfactual Physics in Pixel Space

Figure 2 for Filtered-CoPhy: Unsupervised Learning of Counterfactual Physics in Pixel Space

Figure 3 for Filtered-CoPhy: Unsupervised Learning of Counterfactual Physics in Pixel Space

Figure 4 for Filtered-CoPhy: Unsupervised Learning of Counterfactual Physics in Pixel Space

Abstract:Learning causal relationships in high-dimensional data (images, videos) is a hard task, as they are often defined on low dimensional manifolds and must be extracted from complex signals dominated by appearance, lighting, textures and also spurious correlations in the data. We present a method for learning counterfactual reasoning of physical processes in pixel space, which requires the prediction of the impact of interventions on initial conditions. Going beyond the identification of structural relationships, we deal with the challenging problem of forecasting raw video over long horizons. Our method does not require the knowledge or supervision of any ground truth positions or other object or scene properties. Our model learns and acts on a suitable hybrid latent representation based on a combination of dense features, sets of 2D keypoints and an additional latent vector per keypoint. We show that this better captures the dynamics of physical processes than purely dense or sparse representations. We introduce a new challenging and carefully designed counterfactual benchmark for predictions in pixel space and outperform strong baselines in physics-inspired ML and video prediction.

Via

Access Paper or Ask Questions

BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Dec 24, 2021

Gengshan Yang, Minh Vo, Natalia Neverova, Deva Ramanan, Andrea Vedaldi, Hanbyul Joo

Figure 1 for BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Figure 2 for BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Figure 3 for BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Figure 4 for BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Abstract:Prior work for articulated 3D shape reconstruction often relies on specialized sensors (e.g., synchronized multi-camera systems), or pre-built 3D deformable models (e.g., SMAL or SMPL). Such methods are not able to scale to diverse sets of objects in the wild. We present BANMo, a method that requires neither a specialized sensor nor a pre-defined template shape. BANMo builds high-fidelity, articulated 3D models (including shape and animatable skinning weights) from many monocular casual videos in a differentiable rendering framework. While the use of many videos provides more coverage of camera views and object articulations, they introduce significant challenges in establishing correspondence across scenes with different backgrounds, illumination conditions, etc. Our key insight is to merge three schools of thought; (1) classic deformable shape models that make use of articulated bones and blend skinning, (2) volumetric neural radiance fields (NeRFs) that are amenable to gradient-based optimization, and (3) canonical embeddings that generate correspondences between pixels and an articulated model. We introduce neural blend skinning models that allow for differentiable and invertible articulated deformations. When combined with canonical embeddings, such models allow us to establish dense correspondences across videos that can be self-supervised with cycle consistency. On real and synthetic datasets, BANMo shows higher-fidelity 3D reconstructions than prior works for humans and animals, with the ability to render realistic images from novel viewpoints and poses. Project webpage: banmo-www.github.io .

* Modified Sec. 3.2 deformation model and Sec. 3.4 active sampling

Via

Access Paper or Ask Questions

XCiT: Cross-Covariance Image Transformers

Jun 18, 2021

Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek(+1 more)

Figure 1 for XCiT: Cross-Covariance Image Transformers

Figure 2 for XCiT: Cross-Covariance Image Transformers

Figure 3 for XCiT: Cross-Covariance Image Transformers

Figure 4 for XCiT: Cross-Covariance Image Transformers

Abstract:Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.

Via

Access Paper or Ask Questions

Discovering Relationships between Object Categories via Universal Canonical Maps

Jun 17, 2021

Natalia Neverova, Artsiom Sanakoyeu, Patrick Labatut, David Novotny, Andrea Vedaldi

Figure 1 for Discovering Relationships between Object Categories via Universal Canonical Maps

Figure 2 for Discovering Relationships between Object Categories via Universal Canonical Maps

Figure 3 for Discovering Relationships between Object Categories via Universal Canonical Maps

Figure 4 for Discovering Relationships between Object Categories via Universal Canonical Maps

Abstract:We tackle the problem of learning the geometry of multiple categories of deformable objects jointly. Recent work has shown that it is possible to learn a unified dense pose predictor for several categories of related objects. However, training such models requires to initialize inter-category correspondences by hand. This is suboptimal and the resulting models fail to maintain correct correspondences as individual categories are learned. In this paper, we show that improved correspondences can be learned automatically as a natural byproduct of learning category-specific dense pose predictors. To do this, we express correspondences between different categories and between images and categories using a unified embedding. Then, we use the latter to enforce two constraints: symmetric inter-category cycle consistency and a new asymmetric image-to-category cycle consistency. Without any manual annotations for the inter-category correspondences, we obtain state-of-the-art alignment results, outperforming dedicated methods for matching 3D shapes. Moreover, the new model is also better at the task of dense pose prediction than prior work.

* Accepted at CVPR 2021; Project page: https://gdude.de/discovering-3d-obj-rel

Via

Access Paper or Ask Questions

NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go

Jun 17, 2021

Marvin Eisenberger, David Novotny, Gael Kerchenbaum, Patrick Labatut, Natalia Neverova, Daniel Cremers, Andrea Vedaldi

Figure 1 for NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go

Figure 2 for NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go

Figure 3 for NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go

Figure 4 for NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go

Abstract:We present NeuroMorph, a new neural network architecture that takes as input two 3D shapes and produces in one go, i.e. in a single feed forward pass, a smooth interpolation and point-to-point correspondences between them. The interpolation, expressed as a deformation field, changes the pose of the source shape to resemble the target, but leaves the object identity unchanged. NeuroMorph uses an elegant architecture combining graph convolutions with global feature pooling to extract local features. During training, the model is incentivized to create realistic deformations by approximating geodesics on the underlying shape space manifold. This strong geometric prior allows to train our model end-to-end and in a fully unsupervised manner without requiring any manual correspondence annotations. NeuroMorph works well for a large variety of input shapes, including non-isometric pairs from different object categories. It obtains state-of-the-art results for both shape correspondence and interpolation tasks, matching or surpassing the performance of recent unsupervised and supervised methods on multiple benchmarks.

* Published at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021

Via

Access Paper or Ask Questions