Alert button
Picture for Srinath Sridhar

Srinath Sridhar

Alert button

Strata-NeRF : Neural Radiance Fields for Stratified Scenes

Aug 20, 2023
Ankit Dhiman, Srinath R, Harsh Rangwani, Rishubh Parihar, Lokesh R Boregowda, Srinath Sridhar, R Venkatesh Babu

Neural Radiance Field (NeRF) approaches learn the underlying 3D representation of a scene and generate photo-realistic novel views with high fidelity. However, most proposed settings concentrate on modelling a single object or a single level of a scene. However, in the real world, we may capture a scene at multiple levels, resulting in a layered capture. For example, tourists usually capture a monument's exterior structure before capturing the inner structure. Modelling such scenes in 3D with seamless switching between levels can drastically improve immersive experiences. However, most existing techniques struggle in modelling such scenes. We propose Strata-NeRF, a single neural radiance field that implicitly captures a scene with multiple levels. Strata-NeRF achieves this by conditioning the NeRFs on Vector Quantized (VQ) latent representations which allow sudden changes in scene structure. We evaluate the effectiveness of our approach in multi-layered synthetic dataset comprising diverse scenes and then further validate its generalization on the real-world RealEstate10K dataset. We find that Strata-NeRF effectively captures stratified scenes, minimizes artifacts, and synthesizes high-fidelity views compared to existing approaches.

* ICCV 2023, Project Page: https://ankitatiisc.github.io/Strata-NeRF/ 
Viaarxiv icon

DiVA-360: The Dynamic Visuo-Audio Dataset for Immersive Neural Fields

Jul 31, 2023
Cheng-You Lu, Peisen Zhou, Angela Xing, Chandradeep Pokhariya, Arnab Dey, Ishaan Shah, Rugved Mavidipalli, Dylan Hu, Andrew Comport, Kefan Chen, Srinath Sridhar

Figure 1 for DiVA-360: The Dynamic Visuo-Audio Dataset for Immersive Neural Fields
Figure 2 for DiVA-360: The Dynamic Visuo-Audio Dataset for Immersive Neural Fields
Figure 3 for DiVA-360: The Dynamic Visuo-Audio Dataset for Immersive Neural Fields
Figure 4 for DiVA-360: The Dynamic Visuo-Audio Dataset for Immersive Neural Fields

Advances in neural fields are enabling high-fidelity capture of the shape and appearance of static and dynamic scenes. However, their capabilities lag behind those offered by representations such as pixels or meshes due to algorithmic challenges and the lack of large-scale real-world datasets. We address the dataset limitation with DiVA-360, a real-world 360 dynamic visual-audio dataset with synchronized multimodal visual, audio, and textual information about table-scale scenes. It contains 46 dynamic scenes, 30 static scenes, and 95 static objects spanning 11 categories captured using a new hardware system using 53 RGB cameras at 120 FPS and 6 microphones for a total of 8.6M image frames and 1360 s of dynamic data. We provide detailed text descriptions for all scenes, foreground-background segmentation masks, category-specific 3D pose alignment for static objects, as well as metrics for comparison. Our data, hardware and software, and code are available at https://diva360.github.io/.

Viaarxiv icon

HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork

Jun 09, 2023
Bipasha Sen, Gaurav Singh, Aditya Agarwal, Rohith Agaram, K Madhava Krishna, Srinath Sridhar

Figure 1 for HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork
Figure 2 for HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork
Figure 3 for HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork
Figure 4 for HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork

Neural Radiance Fields (NeRF) have become an increasingly popular representation to capture high-quality appearance and shape of scenes and objects. However, learning generalizable NeRF priors over categories of scenes or objects has been challenging due to the high dimensionality of network weight space. To address the limitations of existing work on generalization, multi-view consistency and to improve quality, we propose HyP-NeRF, a latent conditioning method for learning generalizable category-level NeRF priors using hypernetworks. Rather than using hypernetworks to estimate only the weights of a NeRF, we estimate both the weights and the multi-resolution hash encodings resulting in significant quality gains. To improve quality even further, we incorporate a denoise and finetune strategy that denoises images rendered from NeRFs estimated by the hypernetwork and finetunes it while retaining multiview consistency. These improvements enable us to use HyP-NeRF as a generalizable prior for multiple downstream tasks including NeRF reconstruction from single-view or cluttered scenes and text-to-NeRF. We provide qualitative comparisons and evaluate HyP-NeRF on three tasks: generalization, compression, and retrieval, demonstrating our state-of-the-art results.

Viaarxiv icon

Semantic Attention Flow Fields for Dynamic Scene Decomposition

Mar 02, 2023
Yiqing Liang, Eliot Laidlaw, Alexander Meyerowitz, Srinath Sridhar, James Tompkin

Figure 1 for Semantic Attention Flow Fields for Dynamic Scene Decomposition
Figure 2 for Semantic Attention Flow Fields for Dynamic Scene Decomposition
Figure 3 for Semantic Attention Flow Fields for Dynamic Scene Decomposition
Figure 4 for Semantic Attention Flow Fields for Dynamic Scene Decomposition

We present SAFF: a dynamic neural volume reconstruction of a casual monocular video that consists of time-varying color, density, scene flow, semantics, and attention information. The semantics and attention let us identify salient foreground objects separately from the background in arbitrary spacetime views. We add two network heads to represent the semantic and attention information. For optimization, we design semantic attention pyramids from DINO-ViT outputs that trade detail with whole-image context. After optimization, we perform a saliency-aware clustering to decompose the scene. For evaluation on real-world dynamic scene decomposition across spacetime, we annotate object masks in the NVIDIA Dynamic Scene Dataset. We demonstrate that SAFF can decompose dynamic scenes without affecting RGB or depth reconstruction quality, that volume-integrated SAFF outperforms 2D baselines, and that SAFF improves foreground/background segmentation over recent static/dynamic split methods. Project Webpage: https://visual.cs.brown.edu/saff

Viaarxiv icon

LEGO-Net: Learning Regular Rearrangements of Objects in Rooms

Jan 23, 2023
Qiuhong Anna Wei, Sijie Ding, Jeong Joon Park, Rahul Sajnani, Adrien Poulenard, Srinath Sridhar, Leonidas Guibas

Figure 1 for LEGO-Net: Learning Regular Rearrangements of Objects in Rooms
Figure 2 for LEGO-Net: Learning Regular Rearrangements of Objects in Rooms
Figure 3 for LEGO-Net: Learning Regular Rearrangements of Objects in Rooms
Figure 4 for LEGO-Net: Learning Regular Rearrangements of Objects in Rooms

Humans universally dislike the task of cleaning up a messy room. If machines were to help us with this task, they must understand human criteria for regular arrangements, such as several types of symmetry, co-linearity or co-circularity, spacing uniformity in linear or circular patterns, and further inter-object relationships that relate to style and functionality. Previous approaches for this task relied on human input to explicitly specify goal state, or synthesized scenes from scratch -- but such methods do not address the rearrangement of existing messy scenes without providing a goal state. In this paper, we present LEGO-Net, a data-driven transformer-based iterative method for learning regular rearrangement of objects in messy rooms. LEGO-Net is partly inspired by diffusion models -- it starts with an initial messy state and iteratively "de-noises'' the position and orientation of objects to a regular state while reducing the distance traveled. Given randomly perturbed object positions and orientations in an existing dataset of professionally-arranged scenes, our method is trained to recover a regular re-arrangement. Results demonstrate that our method is able to reliably rearrange room scenes and outperform other methods. We additionally propose a metric for evaluating regularity in room arrangements using number-theoretic machinery.

* Project page: https://ivl.cs.brown.edu/projects/lego-net 
Viaarxiv icon

SCARP: 3D Shape Completion in ARbitrary Poses for Improved Grasping

Jan 17, 2023
Bipasha Sen, Aditya Agarwal, Gaurav Singh, Brojeshwar B., Srinath Sridhar, Madhava Krishna

Figure 1 for SCARP: 3D Shape Completion in ARbitrary Poses for Improved Grasping
Figure 2 for SCARP: 3D Shape Completion in ARbitrary Poses for Improved Grasping
Figure 3 for SCARP: 3D Shape Completion in ARbitrary Poses for Improved Grasping
Figure 4 for SCARP: 3D Shape Completion in ARbitrary Poses for Improved Grasping

Recovering full 3D shapes from partial observations is a challenging task that has been extensively addressed in the computer vision community. Many deep learning methods tackle this problem by training 3D shape generation networks to learn a prior over the full 3D shapes. In this training regime, the methods expect the inputs to be in a fixed canonical form, without which they fail to learn a valid prior over the 3D shapes. We propose SCARP, a model that performs Shape Completion in ARbitrary Poses. Given a partial pointcloud of an object, SCARP learns a disentangled feature representation of pose and shape by relying on rotationally equivariant pose features and geometric shape features trained using a multi-tasking objective. Unlike existing methods that depend on an external canonicalization, SCARP performs canonicalization, pose estimation, and shape completion in a single network, improving the performance by 45% over the existing baselines. In this work, we use SCARP for improving grasp proposals on tabletop objects. By completing partial tabletop objects directly in their observed poses, SCARP enables a SOTA grasp proposal network improve their proposals by 71.2% on partial shapes. Project page: https://bipashasen.github.io/scarp

* Accepted at ICRA 2023 
Viaarxiv icon

Canonical Fields: Self-Supervised Learning of Pose-Canonicalized Neural Fields

Dec 05, 2022
Rohith Agaram, Shaurya Dewan, Rahul Sajnani, Adrien Poulenard, Madhava Krishna, Srinath Sridhar

Figure 1 for Canonical Fields: Self-Supervised Learning of Pose-Canonicalized Neural Fields
Figure 2 for Canonical Fields: Self-Supervised Learning of Pose-Canonicalized Neural Fields
Figure 3 for Canonical Fields: Self-Supervised Learning of Pose-Canonicalized Neural Fields
Figure 4 for Canonical Fields: Self-Supervised Learning of Pose-Canonicalized Neural Fields

Coordinate-based implicit neural networks, or neural fields, have emerged as useful representations of shape and appearance in 3D computer vision. Despite advances however, it remains challenging to build neural fields for categories of objects without datasets like ShapeNet that provide canonicalized object instances that are consistently aligned for their 3D position and orientation (pose). We present Canonical Field Network (CaFi-Net), a self-supervised method to canonicalize the 3D pose of instances from an object category represented as neural fields, specifically neural radiance fields (NeRFs). CaFi-Net directly learns from continuous and noisy radiance fields using a Siamese network architecture that is designed to extract equivariant field features for category-level canonicalization. During inference, our method takes pre-trained neural radiance fields of novel object instances at arbitrary 3D pose, and estimates a canonical field with consistent 3D pose across the entire category. Extensive experiments on a new dataset of 1300 NeRF models across 13 object categories show that our method matches or exceeds the performance of 3D point cloud-based methods.

Viaarxiv icon

TextCraft: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Text

Nov 04, 2022
Aditya Sanghi, Rao Fu, Vivian Liu, Karl Willis, Hooman Shayani, Amir Hosein Khasahmadi, Srinath Sridhar, Daniel Ritchie

Figure 1 for TextCraft: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Text
Figure 2 for TextCraft: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Text
Figure 3 for TextCraft: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Text
Figure 4 for TextCraft: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Text

Language is one of the primary means by which we describe the 3D world around us. While rapid progress has been made in text-to-2D-image synthesis, similar progress in text-to-3D-shape synthesis has been hindered by the lack of paired (text, shape) data. Moreover, extant methods for text-to-shape generation have limited shape diversity and fidelity. We introduce TextCraft, a method to address these limitations by producing high-fidelity and diverse 3D shapes without the need for (text, shape) pairs for training. TextCraft achieves this by using CLIP and using a multi-resolution approach by first generating in a low-dimensional latent space and then upscaling to a higher resolution, improving the fidelity of the generated shape. To improve shape diversity, we use a discrete latent space which is modelled using a bidirectional transformer conditioned on the interchangeable image-text embedding space induced by CLIP. Moreover, we present a novel variant of classifier-free guidance, which further improves the accuracy-diversity trade-off. Finally, we perform extensive experiments that demonstrate that TextCraft outperforms state-of-the-art baselines.

Viaarxiv icon

ShapeCrafter: A Recursive Text-Conditioned 3D Shape Generation Model

Jul 19, 2022
Rao Fu, Xiao Zhan, Yiwen Chen, Daniel Ritchie, Srinath Sridhar

Figure 1 for ShapeCrafter: A Recursive Text-Conditioned 3D Shape Generation Model
Figure 2 for ShapeCrafter: A Recursive Text-Conditioned 3D Shape Generation Model
Figure 3 for ShapeCrafter: A Recursive Text-Conditioned 3D Shape Generation Model
Figure 4 for ShapeCrafter: A Recursive Text-Conditioned 3D Shape Generation Model

We present ShapeCrafter, a neural network for recursive text-conditioned 3D shape generation. Existing methods to generate text-conditioned 3D shapes consume an entire text prompt to generate a 3D shape in a single step. However, humans tend to describe shapes recursively-we may start with an initial description and progressively add details based on intermediate results. To capture this recursive process, we introduce a method to generate a 3D shape distribution, conditioned on an initial phrase, that gradually evolves as more phrases are added. Since existing datasets are insufficient for training this approach, we present Text2Shape++, a large dataset of 369K shape-text pairs that supports recursive shape generation. To capture local details that are often used to refine shape descriptions, we build on top of vector-quantized deep implicit functions that generate a distribution of high-quality shapes. Results show that our method can generate shapes consistent with text descriptions, and shapes evolve gradually as more phrases are added. Our method supports shape editing, extrapolation, and can enable new applications in human-machine collaboration for creative design.

Viaarxiv icon

Unsupervised Kinematic Motion Detection for Part-segmented 3D Shape Collections

Jun 17, 2022
Xianghao Xu, Yifan Ruan, Srinath Sridhar, Daniel Ritchie

Figure 1 for Unsupervised Kinematic Motion Detection for Part-segmented 3D Shape Collections
Figure 2 for Unsupervised Kinematic Motion Detection for Part-segmented 3D Shape Collections
Figure 3 for Unsupervised Kinematic Motion Detection for Part-segmented 3D Shape Collections
Figure 4 for Unsupervised Kinematic Motion Detection for Part-segmented 3D Shape Collections

3D models of manufactured objects are important for populating virtual worlds and for synthetic data generation for vision and robotics. To be most useful, such objects should be articulated: their parts should move when interacted with. While articulated object datasets exist, creating them is labor-intensive. Learning-based prediction of part motions can help, but all existing methods require annotated training data. In this paper, we present an unsupervised approach for discovering articulated motions in a part-segmented 3D shape collection. Our approach is based on a concept we call category closure: any valid articulation of an object's parts should keep the object in the same semantic category (e.g. a chair stays a chair). We operationalize this concept with an algorithm that optimizes a shape's part motion parameters such that it can transform into other shapes in the collection. We evaluate our approach by using it to re-discover part motions from the PartNet-Mobility dataset. For almost all shape categories, our method's predicted motion parameters have low error with respect to ground truth annotations, outperforming two supervised motion prediction methods.

* SIGGRAPH 2022 
Viaarxiv icon