Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Deva Ramanan

RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild

Aug 11, 2022

Jason Y. Zhang, Deva Ramanan, Shubham Tulsiani

Figure 1 for RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild

Figure 2 for RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild

Figure 3 for RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild

Figure 4 for RelPose: Predicting Probabilistic Relative Rotation for Single Objects in the Wild

Abstract:We describe a data-driven method for inferring the camera viewpoints given multiple images of an arbitrary object. This task is a core component of classic geometric pipelines such as SfM and SLAM, and also serves as a vital pre-processing requirement for contemporary neural approaches (e.g. NeRF) to object reconstruction and view synthesis. In contrast to existing correspondence-driven methods that do not perform well given sparse views, we propose a top-down prediction based approach for estimating camera viewpoints. Our key technical insight is the use of an energy-based formulation for representing distributions over relative camera rotations, thus allowing us to explicitly represent multiple camera modes arising from object symmetries or views. Leveraging these relative predictions, we jointly estimate a consistent set of camera rotations from multiple images. We show that our approach outperforms state-of-the-art SfM and SLAM methods given sparse images on both seen and unseen categories. Further, our probabilistic approach significantly outperforms directly regressing relative poses, suggesting that modeling multimodality is important for coherent joint reconstruction. We demonstrate that our system can be a stepping stone toward in-the-wild reconstruction from multi-view datasets. The project page with code and videos can be found at https://jasonyzhang.com/relpose.

* In ECCV 2022

Via

Access Paper or Ask Questions

Differentiable Soft-Masked Attention

Jun 01, 2022

Ali Athar, Jonathon Luiten, Alexander Hermans, Deva Ramanan, Bastian Leibe

Figure 1 for Differentiable Soft-Masked Attention

Figure 2 for Differentiable Soft-Masked Attention

Figure 3 for Differentiable Soft-Masked Attention

Figure 4 for Differentiable Soft-Masked Attention

Abstract:Transformers have become prevalent in computer vision due to their performance and flexibility in modelling complex operations. Of particular significance is the 'cross-attention' operation, which allows a vector representation (e.g. of an object in an image) to be learned by attending to an arbitrarily sized set of input features. Recently, "Masked Attention" was proposed in which a given object representation only attends to those image pixel features for which the segmentation mask of that object is active. This specialization of attention proved beneficial for various image and video segmentation tasks. In this paper, we propose another specialization of attention which enables attending over `soft-masks' (those with continuous mask probabilities instead of binary values), and is also differentiable through these mask probabilities, thus allowing the mask used for attention to be learned within the network without requiring direct loss supervision. This can be useful for several applications. Specifically, we employ our "Differentiable Soft-Masked Attention" for the task of Weakly-Supervised Video Object Segmentation (VOS), where we develop a transformer-based network for VOS which only requires a single annotated image frame for training, but can also benefit from cycle consistency training on a video with just one annotated frame. Although there is no loss for masks in unlabeled frames, the network is still able to segment objects in those frames due to our novel attention formulation.

* arXiv admin note: text overlap with arXiv:2112.09131

Via

Access Paper or Ask Questions

Robust Modeling and Controls for Racing on the Edge

May 22, 2022

Joshua Spisak, Andrew Saba, Nayana Suvarna, Brian Mao, Chuan Tian Zhang, Chris Chang, Sebastian Scherer, Deva Ramanan

Figure 1 for Robust Modeling and Controls for Racing on the Edge

Figure 2 for Robust Modeling and Controls for Racing on the Edge

Figure 3 for Robust Modeling and Controls for Racing on the Edge

Figure 4 for Robust Modeling and Controls for Racing on the Edge

Abstract:Race cars are routinely driven to the edge of their handling limits in dynamic scenarios well above 200mph. Similar challenges are posed in autonomous racing, where a software stack, instead of a human driver, interacts within a multi-agent environment. For an Autonomous Racing Vehicle (ARV), operating at the edge of handling limits and acting safely in these dynamic environments is still an unsolved problem. In this paper, we present a baseline controls stack for an ARV capable of operating safely up to 140mph. Additionally, limitations in the current approach are discussed to highlight the need for improved dynamics modeling and learning.

Via

Access Paper or Ask Questions

Forecasting from LiDAR via Future Object Detection

Mar 31, 2022

Neehar Peri, Jonathon Luiten, Mengtian Li, Aljoša Ošep, Laura Leal-Taixé, Deva Ramanan

Figure 1 for Forecasting from LiDAR via Future Object Detection

Figure 2 for Forecasting from LiDAR via Future Object Detection

Figure 3 for Forecasting from LiDAR via Future Object Detection

Figure 4 for Forecasting from LiDAR via Future Object Detection

Abstract:Object detection and forecasting are fundamental components of embodied perception. These two problems, however, are largely studied in isolation by the community. In this paper, we propose an end-to-end approach for detection and motion forecasting based on raw sensor measurement as opposed to ground truth tracks. Instead of predicting the current frame locations and forecasting forward in time, we directly predict future object locations and backcast to determine where each trajectory began. Our approach not only improves overall accuracy compared to other modular or end-to-end baselines, it also prompts us to rethink the role of explicit tracking for embodied perception. Additionally, by linking future and current locations in a many-to-one manner, our approach is able to reason about multiple futures, a capability that was previously considered difficult for end-to-end approaches. We conduct extensive experiments on the popular nuScenes dataset and demonstrate the empirical effectiveness of our approach. In addition, we investigate the appropriateness of reusing standard forecasting metrics for an end-to-end setup, and find a number of limitations which allow us to build simple baselines to game these metrics. We address this issue with a novel set of joint forecasting and detection metrics that extend the commonly used AP metrics from the detection community to measuring forecasting accuracy. Our code is available at https://github.com/neeharperi/FutureDet

* This work has been accepted to Computer Vision and Pattern Recognition (CVPR) 2022

Via

Access Paper or Ask Questions

Long-Tailed Recognition via Weight Balancing

Mar 27, 2022

Shaden Alshammari, Yu-Xiong Wang, Deva Ramanan, Shu Kong

Figure 1 for Long-Tailed Recognition via Weight Balancing

Figure 2 for Long-Tailed Recognition via Weight Balancing

Figure 3 for Long-Tailed Recognition via Weight Balancing

Figure 4 for Long-Tailed Recognition via Weight Balancing

Abstract:In the real open world, data tends to follow long-tailed class distributions, motivating the well-studied long-tailed recognition (LTR) problem. Naive training produces models that are biased toward common classes in terms of higher accuracy. The key to addressing LTR is to balance various aspects including data distribution, training losses, and gradients in learning. We explore an orthogonal direction, weight balancing, motivated by the empirical observation that the naively trained classifier has "artificially" larger weights in norm for common classes (because there exists abundant data to train them, unlike the rare classes). We investigate three techniques to balance weights, L2-normalization, weight decay, and MaxNorm. We first point out that L2-normalization "perfectly" balances per-class weights to be unit norm, but such a hard constraint might prevent classes from learning better classifiers. In contrast, weight decay penalizes larger weights more heavily and so learns small balanced weights; the MaxNorm constraint encourages growing small weights within a norm ball but caps all the weights by the radius. Our extensive study shows that both help learn balanced weights and greatly improve the LTR accuracy. Surprisingly, weight decay, although underexplored in LTR, significantly improves over prior work. Therefore, we adopt a two-stage training paradigm and propose a simple approach to LTR: (1) learning features using the cross-entropy loss by tuning weight decay, and (2) learning classifiers using class-balanced loss by tuning weight decay and MaxNorm. Our approach achieves the state-of-the-art accuracy on five standard benchmarks, serving as a future baseline for long-tailed recognition.

* CVPR 2022 paper; github: https://github.com/ShadeAlsha/LTR-weight-balancing

Via

Access Paper or Ask Questions

The CLEAR Benchmark: Continual LEArning on Real-World Imagery

Jan 17, 2022

Zhiqiu Lin, Jia Shi, Deepak Pathak, Deva Ramanan

Figure 1 for The CLEAR Benchmark: Continual LEArning on Real-World Imagery

Figure 2 for The CLEAR Benchmark: Continual LEArning on Real-World Imagery

Figure 3 for The CLEAR Benchmark: Continual LEArning on Real-World Imagery

Figure 4 for The CLEAR Benchmark: Continual LEArning on Real-World Imagery

Abstract:Continual learning (CL) is widely regarded as crucial challenge for lifelong AI. However, existing CL benchmarks, e.g. Permuted-MNIST and Split-CIFAR, make use of artificial temporal variation and do not align with or generalize to the real-world. In this paper, we introduce CLEAR, the first continual image classification benchmark dataset with a natural temporal evolution of visual concepts in the real world that spans a decade (2004-2014). We build CLEAR from existing large-scale image collections (YFCC100M) through a novel and scalable low-cost approach to visio-linguistic dataset curation. Our pipeline makes use of pretrained vision-language models (e.g. CLIP) to interactively build labeled datasets, which are further validated with crowd-sourcing to remove errors and even inappropriate images (hidden in original YFCC100M). The major strength of CLEAR over prior CL benchmarks is the smooth temporal evolution of visual concepts with real-world imagery, including both high-quality labeled data along with abundant unlabeled samples per time period for continual semi-supervised learning. We find that a simple unsupervised pre-training step can already boost state-of-the-art CL algorithms that only utilize fully-supervised data. Our analysis also reveals that mainstream CL evaluation protocols that train and test on iid data artificially inflate performance of CL system. To address this, we propose novel "streaming" protocols for CL that always test on the (near) future. Interestingly, streaming protocols (a) can simplify dataset curation since today's testset can be repurposed for tomorrow's trainset and (b) can produce more generalizable models with more accurate estimates of performance since all labeled data from each time-period is used for both training and testing (unlike classic iid train-test splits).

Via

Access Paper or Ask Questions

BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Dec 24, 2021

Gengshan Yang, Minh Vo, Natalia Neverova, Deva Ramanan, Andrea Vedaldi, Hanbyul Joo

Figure 1 for BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Figure 2 for BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Figure 3 for BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Figure 4 for BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Abstract:Prior work for articulated 3D shape reconstruction often relies on specialized sensors (e.g., synchronized multi-camera systems), or pre-built 3D deformable models (e.g., SMAL or SMPL). Such methods are not able to scale to diverse sets of objects in the wild. We present BANMo, a method that requires neither a specialized sensor nor a pre-defined template shape. BANMo builds high-fidelity, articulated 3D models (including shape and animatable skinning weights) from many monocular casual videos in a differentiable rendering framework. While the use of many videos provides more coverage of camera views and object articulations, they introduce significant challenges in establishing correspondence across scenes with different backgrounds, illumination conditions, etc. Our key insight is to merge three schools of thought; (1) classic deformable shape models that make use of articulated bones and blend skinning, (2) volumetric neural radiance fields (NeRFs) that are amenable to gradient-based optimization, and (3) canonical embeddings that generate correspondences between pixels and an articulated model. We introduce neural blend skinning models that allow for differentiable and invertible articulated deformations. When combined with canonical embeddings, such models allow us to establish dense correspondences across videos that can be self-supervised with cycle consistency. On real and synthetic datasets, BANMo shows higher-fidelity 3D reconstructions than prior works for humans and animals, with the ability to render realistic images from novel viewpoints and poses. Project webpage: banmo-www.github.io .

* Modified Sec. 3.2 deformation model and Sec. 3.4 active sampling

Via

Access Paper or Ask Questions

Mega-NeRF: Scalable Construction of Large-Scale NeRFs for Virtual Fly-Throughs

Dec 20, 2021

Haithem Turki, Deva Ramanan, Mahadev Satyanarayanan

Figure 1 for Mega-NeRF: Scalable Construction of Large-Scale NeRFs for Virtual Fly-Throughs

Figure 2 for Mega-NeRF: Scalable Construction of Large-Scale NeRFs for Virtual Fly-Throughs

Figure 3 for Mega-NeRF: Scalable Construction of Large-Scale NeRFs for Virtual Fly-Throughs

Figure 4 for Mega-NeRF: Scalable Construction of Large-Scale NeRFs for Virtual Fly-Throughs

Abstract:We explore how to leverage neural radiance fields (NeRFs) to build interactive 3D environments from large-scale visual captures spanning buildings or even multiple city blocks collected primarily from drone data. In contrast to the single object scenes against which NeRFs have been traditionally evaluated, this setting poses multiple challenges including (1) the need to incorporate thousands of images with varying lighting conditions, all of which capture only a small subset of the scene, (2) prohibitively high model capacity and ray sampling requirements beyond what can be naively trained on a single GPU, and (3) an arbitrarily large number of possible viewpoints that make it unfeasible to precompute all relevant information beforehand (as real-time NeRF renderers typically do). To address these challenges, we begin by analyzing visibility statistics for large-scale scenes, motivating a sparse network structure where parameters are specialized to different regions of the scene. We introduce a simple geometric clustering algorithm that partitions training images (or rather pixels) into different NeRF submodules that can be trained in parallel. We evaluate our approach across scenes taken from the Quad 6k and UrbanScene3D datasets as well as against our own drone footage and show a 3x training speedup while improving PSNR by over 11% on average. We subsequently perform an empirical evaluation of recent NeRF fast renderers on top of Mega-NeRF and introduce a novel method that exploits temporal coherence. Our technique achieves a 40x speedup over conventional NeRF rendering while remaining within 0.5 db in PSNR quality, exceeding the fidelity of existing fast renderers.

* Project page: https://meganerf.cmusatyalab.org/ GitHub: https://github.com/cmusatyalab/mega-nerf

Via

Access Paper or Ask Questions

HODOR: High-level Object Descriptors for Object Re-segmentation in Video Learned from Static Images

Dec 16, 2021

Ali Athar, Jonathon Luiten, Alexander Hermans, Deva Ramanan, Bastian Leibe

Figure 1 for HODOR: High-level Object Descriptors for Object Re-segmentation in Video Learned from Static Images

Figure 2 for HODOR: High-level Object Descriptors for Object Re-segmentation in Video Learned from Static Images

Figure 3 for HODOR: High-level Object Descriptors for Object Re-segmentation in Video Learned from Static Images

Figure 4 for HODOR: High-level Object Descriptors for Object Re-segmentation in Video Learned from Static Images

Abstract:Existing state-of-the-art methods for Video Object Segmentation (VOS) learn low-level pixel-to-pixel correspondences between frames to propagate object masks across video. This requires a large amount of densely annotated video data, which is costly to annotate, and largely redundant since frames within a video are highly correlated. In light of this, we propose HODOR: a novel method that tackles VOS by effectively leveraging annotated static images for understanding object appearance and scene context. We encode object instances and scene information from an image frame into robust high-level descriptors which can then be used to re-segment those objects in different frames. As a result, HODOR achieves state-of-the-art performance on the DAVIS and YouTube-VOS benchmarks compared to existing methods trained without video annotations. Without any architectural modification, HODOR can also learn from video context around single annotated video frames by utilizing cyclic consistency, whereas other methods rely on dense, temporally consistent annotations.

Via

Access Paper or Ask Questions

NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild

Oct 18, 2021

Jason Y. Zhang, Gengshan Yang, Shubham Tulsiani, Deva Ramanan

Figure 1 for NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild

Figure 2 for NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild

Figure 3 for NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild

Figure 4 for NeRS: Neural Reflectance Surfaces for Sparse-view 3D Reconstruction in the Wild

Abstract:Recent history has seen a tremendous growth of work exploring implicit representations of geometry and radiance, popularized through Neural Radiance Fields (NeRF). Such works are fundamentally based on a (implicit) volumetric representation of occupancy, allowing them to model diverse scene structure including translucent objects and atmospheric obscurants. But because the vast majority of real-world scenes are composed of well-defined surfaces, we introduce a surface analog of such implicit models called Neural Reflectance Surfaces (NeRS). NeRS learns a neural shape representation of a closed surface that is diffeomorphic to a sphere, guaranteeing water-tight reconstructions. Even more importantly, surface parameterizations allow NeRS to learn (neural) bidirectional surface reflectance functions (BRDFs) that factorize view-dependent appearance into environmental illumination, diffuse color (albedo), and specular "shininess." Finally, rather than illustrating our results on synthetic scenes or controlled in-the-lab capture, we assemble a novel dataset of multi-view images from online marketplaces for selling goods. Such "in-the-wild" multi-view image sets pose a number of challenges, including a small number of views with unknown/rough camera estimates. We demonstrate that surface-based neural reconstructions enable learning from such data, outperforming volumetric neural rendering-based reconstructions. We hope that NeRS serves as a first step toward building scalable, high-quality libraries of real-world shape, materials, and illumination. The project page with code and video visualizations can be found at https://jasonyzhang.com/ners.

* In NeurIPS 2021. v2-3: Fixed minor typos

Via

Access Paper or Ask Questions