Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Srinath Sridhar

GeoDiffuser: Geometry-Based Image Editing with Diffusion Models

Apr 22, 2024

Rahul Sajnani, Jeroen Vanbaar, Jie Min, Kapil Katyal, Srinath Sridhar

Figure 1 for GeoDiffuser: Geometry-Based Image Editing with Diffusion Models

Figure 2 for GeoDiffuser: Geometry-Based Image Editing with Diffusion Models

Figure 3 for GeoDiffuser: Geometry-Based Image Editing with Diffusion Models

Figure 4 for GeoDiffuser: Geometry-Based Image Editing with Diffusion Models

Abstract:The success of image generative models has enabled us to build methods that can edit images based on text or other user input. However, these methods are bespoke, imprecise, require additional information, or are limited to only 2D image edits. We present GeoDiffuser, a zero-shot optimization-based method that unifies common 2D and 3D image-based object editing capabilities into a single method. Our key insight is to view image editing operations as geometric transformations. We show that these transformations can be directly incorporated into the attention layers in diffusion models to implicitly perform editing operations. Our training-free optimization method uses an objective function that seeks to preserve object style but generate plausible images, for instance with accurate lighting and shadows. It also inpaints disoccluded parts of the image where the object was originally located. Given a natural image and user input, we segment the foreground object using SAM and estimate a corresponding transform which is used by our optimization approach for editing. GeoDiffuser can perform common 2D and 3D edits like object translation, 3D rotation, and removal. We present quantitative results, including a perceptual study, that shows how our approach is better than existing methods. Visit https://ivl.cs.brown.edu/research/geodiffuser.html for more information.

Via

Access Paper or Ask Questions

GHNeRF: Learning Generalizable Human Features with Efficient Neural Radiance Fields

Apr 09, 2024

Arnab Dey, Di Yang, Rohith Agaram, Antitza Dantcheva, Andrew I. Comport, Srinath Sridhar, Jean Martinet

Figure 1 for GHNeRF: Learning Generalizable Human Features with Efficient Neural Radiance Fields

Figure 2 for GHNeRF: Learning Generalizable Human Features with Efficient Neural Radiance Fields

Figure 3 for GHNeRF: Learning Generalizable Human Features with Efficient Neural Radiance Fields

Figure 4 for GHNeRF: Learning Generalizable Human Features with Efficient Neural Radiance Fields

Abstract:Recent advances in Neural Radiance Fields (NeRF) have demonstrated promising results in 3D scene representations, including 3D human representations. However, these representations often lack crucial information on the underlying human pose and structure, which is crucial for AR/VR applications and games. In this paper, we introduce a novel approach, termed GHNeRF, designed to address these limitations by learning 2D/3D joint locations of human subjects with NeRF representation. GHNeRF uses a pre-trained 2D encoder streamlined to extract essential human features from 2D images, which are then incorporated into the NeRF framework in order to encode human biomechanic features. This allows our network to simultaneously learn biomechanic features, such as joint locations, along with human geometry and texture. To assess the effectiveness of our method, we conduct a comprehensive comparison with state-of-the-art human NeRF techniques and joint estimation algorithms. Our results show that GHNeRF can achieve state-of-the-art results in near real-time.

Via

Access Paper or Ask Questions

Constrained 6-DoF Grasp Generation on Complex Shapes for Improved Dual-Arm Manipulation

Apr 06, 2024

Gaurav Singh, Sanket Kalwar, Md Faizal Karim, Bipasha Sen, Nagamanikandan Govindan, Srinath Sridhar, K Madhava Krishna

Figure 1 for Constrained 6-DoF Grasp Generation on Complex Shapes for Improved Dual-Arm Manipulation

Figure 2 for Constrained 6-DoF Grasp Generation on Complex Shapes for Improved Dual-Arm Manipulation

Figure 3 for Constrained 6-DoF Grasp Generation on Complex Shapes for Improved Dual-Arm Manipulation

Figure 4 for Constrained 6-DoF Grasp Generation on Complex Shapes for Improved Dual-Arm Manipulation

Abstract:Efficiently generating grasp poses tailored to specific regions of an object is vital for various robotic manipulation tasks, especially in a dual-arm setup. This scenario presents a significant challenge due to the complex geometries involved, requiring a deep understanding of the local geometry to generate grasps efficiently on the specified constrained regions. Existing methods only explore settings involving table-top/small objects and require augmented datasets to train, limiting their performance on complex objects. We propose CGDF: Constrained Grasp Diffusion Fields, a diffusion-based grasp generative model that generalizes to objects with arbitrary geometries, as well as generates dense grasps on the target regions. CGDF uses a part-guided diffusion approach that enables it to get high sample efficiency in constrained grasping without explicitly training on massive constraint-augmented datasets. We provide qualitative and quantitative comparisons using analytical metrics and in simulation, in both unconstrained and constrained settings to show that our method can generalize to generate stable grasps on complex objects, especially useful for dual-arm manipulation settings, while existing methods struggle to do so.

* Project Page: https://constrained-grasp-diffusion.github.io/

Via

Access Paper or Ask Questions

AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes

Dec 11, 2023

Zehao Wen, Zichen Liu, Srinath Sridhar, Rao Fu

Abstract:We introduce AnyHome, a framework that translates open-vocabulary descriptions, ranging from simple labels to elaborate paragraphs, into well-structured and textured 3D indoor scenes at a house-scale. Inspired by cognition theories, AnyHome employs an amodal structured representation to capture 3D spatial cues from textual narratives and then uses egocentric inpainting to enrich these scenes. To this end, we begin by using specially designed template prompts for Large Language Models (LLMs), which enable precise control over the textual input. We then utilize intermediate representations to maintain the spatial structure's consistency, ensuring that the 3D scenes align closely with the textual description. Then, we apply a Score Distillation Sampling process to refine the placement of objects. Lastly, an egocentric inpainting process is incorporated to enhance the realism and appearance of the scenes. AnyHome stands out due to its hierarchical structured representation combined with the versatility of open-vocabulary text interpretation. This allows for extensive customization of indoor scenes at various levels of granularity. We demonstrate that AnyHome can reliably generate a range of diverse indoor scenes, characterized by their detailed spatial structures and textures, all corresponding to the free-form textual inputs.

Via

Access Paper or Ask Questions

MANUS: Markerless Hand-Object Grasp Capture using Articulated 3D Gaussians

Dec 04, 2023

Chandradeep Pokhariya, Ishaan N Shah, Angela Xing, Zekun Li, Kefan Chen, Avinash Sharma, Srinath Sridhar

Figure 1 for MANUS: Markerless Hand-Object Grasp Capture using Articulated 3D Gaussians

Figure 2 for MANUS: Markerless Hand-Object Grasp Capture using Articulated 3D Gaussians

Figure 3 for MANUS: Markerless Hand-Object Grasp Capture using Articulated 3D Gaussians

Figure 4 for MANUS: Markerless Hand-Object Grasp Capture using Articulated 3D Gaussians

Abstract:Understanding how we grasp objects with our hands has important applications in areas like robotics and mixed reality. However, this challenging problem requires accurate modeling of the contact between hands and objects. To capture grasps, existing methods use skeletons, meshes, or parametric models that can cause misalignments resulting in inaccurate contacts. We present MANUS, a method for Markerless Hand-Object Grasp Capture using Articulated 3D Gaussians. We build a novel articulated 3D Gaussians representation that extends 3D Gaussian splatting for high-fidelity representation of articulating hands. Since our representation uses Gaussian primitives, it enables us to efficiently and accurately estimate contacts between the hand and the object. For the most accurate results, our method requires tens of camera views that current datasets do not provide. We therefore build MANUS-Grasps, a new dataset that contains hand-object grasps viewed from 53 cameras across 30+ scenes, 3 subjects, and comprising over 7M frames. In addition to extensive qualitative results, we also show that our method outperforms others on a quantitative contact evaluation method that uses paint transfer from the object to the hand.

Via

Access Paper or Ask Questions

Strata-NeRF : Neural Radiance Fields for Stratified Scenes

Aug 20, 2023

Ankit Dhiman, Srinath R, Harsh Rangwani, Rishubh Parihar, Lokesh R Boregowda, Srinath Sridhar, R Venkatesh Babu

Figure 1 for Strata-NeRF : Neural Radiance Fields for Stratified Scenes

Figure 2 for Strata-NeRF : Neural Radiance Fields for Stratified Scenes

Figure 3 for Strata-NeRF : Neural Radiance Fields for Stratified Scenes

Figure 4 for Strata-NeRF : Neural Radiance Fields for Stratified Scenes

Abstract:Neural Radiance Field (NeRF) approaches learn the underlying 3D representation of a scene and generate photo-realistic novel views with high fidelity. However, most proposed settings concentrate on modelling a single object or a single level of a scene. However, in the real world, we may capture a scene at multiple levels, resulting in a layered capture. For example, tourists usually capture a monument's exterior structure before capturing the inner structure. Modelling such scenes in 3D with seamless switching between levels can drastically improve immersive experiences. However, most existing techniques struggle in modelling such scenes. We propose Strata-NeRF, a single neural radiance field that implicitly captures a scene with multiple levels. Strata-NeRF achieves this by conditioning the NeRFs on Vector Quantized (VQ) latent representations which allow sudden changes in scene structure. We evaluate the effectiveness of our approach in multi-layered synthetic dataset comprising diverse scenes and then further validate its generalization on the real-world RealEstate10K dataset. We find that Strata-NeRF effectively captures stratified scenes, minimizes artifacts, and synthesizes high-fidelity views compared to existing approaches.

* ICCV 2023, Project Page: https://ankitatiisc.github.io/Strata-NeRF/

Via

Access Paper or Ask Questions

DiVA-360: The Dynamic Visuo-Audio Dataset for Immersive Neural Fields

Jul 31, 2023

Cheng-You Lu, Peisen Zhou, Angela Xing, Chandradeep Pokhariya, Arnab Dey, Ishaan Shah, Rugved Mavidipalli, Dylan Hu, Andrew Comport, Kefan Chen(+1 more)

Figure 1 for DiVA-360: The Dynamic Visuo-Audio Dataset for Immersive Neural Fields

Figure 2 for DiVA-360: The Dynamic Visuo-Audio Dataset for Immersive Neural Fields

Figure 3 for DiVA-360: The Dynamic Visuo-Audio Dataset for Immersive Neural Fields

Figure 4 for DiVA-360: The Dynamic Visuo-Audio Dataset for Immersive Neural Fields

Abstract:Advances in neural fields are enabling high-fidelity capture of the shape and appearance of static and dynamic scenes. However, their capabilities lag behind those offered by representations such as pixels or meshes due to algorithmic challenges and the lack of large-scale real-world datasets. We address the dataset limitation with DiVA-360, a real-world 360 dynamic visual-audio dataset with synchronized multimodal visual, audio, and textual information about table-scale scenes. It contains 46 dynamic scenes, 30 static scenes, and 95 static objects spanning 11 categories captured using a new hardware system using 53 RGB cameras at 120 FPS and 6 microphones for a total of 8.6M image frames and 1360 s of dynamic data. We provide detailed text descriptions for all scenes, foreground-background segmentation masks, category-specific 3D pose alignment for static objects, as well as metrics for comparison. Our data, hardware and software, and code are available at https://diva360.github.io/.

Via

Access Paper or Ask Questions

HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork

Jun 09, 2023

Bipasha Sen, Gaurav Singh, Aditya Agarwal, Rohith Agaram, K Madhava Krishna, Srinath Sridhar

Figure 1 for HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork

Figure 2 for HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork

Figure 3 for HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork

Figure 4 for HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork

Abstract:Neural Radiance Fields (NeRF) have become an increasingly popular representation to capture high-quality appearance and shape of scenes and objects. However, learning generalizable NeRF priors over categories of scenes or objects has been challenging due to the high dimensionality of network weight space. To address the limitations of existing work on generalization, multi-view consistency and to improve quality, we propose HyP-NeRF, a latent conditioning method for learning generalizable category-level NeRF priors using hypernetworks. Rather than using hypernetworks to estimate only the weights of a NeRF, we estimate both the weights and the multi-resolution hash encodings resulting in significant quality gains. To improve quality even further, we incorporate a denoise and finetune strategy that denoises images rendered from NeRFs estimated by the hypernetwork and finetunes it while retaining multiview consistency. These improvements enable us to use HyP-NeRF as a generalizable prior for multiple downstream tasks including NeRF reconstruction from single-view or cluttered scenes and text-to-NeRF. We provide qualitative comparisons and evaluate HyP-NeRF on three tasks: generalization, compression, and retrieval, demonstrating our state-of-the-art results.

Via

Access Paper or Ask Questions

Semantic Attention Flow Fields for Dynamic Scene Decomposition

Mar 02, 2023

Yiqing Liang, Eliot Laidlaw, Alexander Meyerowitz, Srinath Sridhar, James Tompkin

Figure 1 for Semantic Attention Flow Fields for Dynamic Scene Decomposition

Figure 2 for Semantic Attention Flow Fields for Dynamic Scene Decomposition

Figure 3 for Semantic Attention Flow Fields for Dynamic Scene Decomposition

Figure 4 for Semantic Attention Flow Fields for Dynamic Scene Decomposition

Abstract:We present SAFF: a dynamic neural volume reconstruction of a casual monocular video that consists of time-varying color, density, scene flow, semantics, and attention information. The semantics and attention let us identify salient foreground objects separately from the background in arbitrary spacetime views. We add two network heads to represent the semantic and attention information. For optimization, we design semantic attention pyramids from DINO-ViT outputs that trade detail with whole-image context. After optimization, we perform a saliency-aware clustering to decompose the scene. For evaluation on real-world dynamic scene decomposition across spacetime, we annotate object masks in the NVIDIA Dynamic Scene Dataset. We demonstrate that SAFF can decompose dynamic scenes without affecting RGB or depth reconstruction quality, that volume-integrated SAFF outperforms 2D baselines, and that SAFF improves foreground/background segmentation over recent static/dynamic split methods. Project Webpage: https://visual.cs.brown.edu/saff

Via

Access Paper or Ask Questions

LEGO-Net: Learning Regular Rearrangements of Objects in Rooms

Jan 23, 2023

Qiuhong Anna Wei, Sijie Ding, Jeong Joon Park, Rahul Sajnani, Adrien Poulenard, Srinath Sridhar, Leonidas Guibas

Figure 1 for LEGO-Net: Learning Regular Rearrangements of Objects in Rooms

Figure 2 for LEGO-Net: Learning Regular Rearrangements of Objects in Rooms

Figure 3 for LEGO-Net: Learning Regular Rearrangements of Objects in Rooms

Figure 4 for LEGO-Net: Learning Regular Rearrangements of Objects in Rooms

Abstract:Humans universally dislike the task of cleaning up a messy room. If machines were to help us with this task, they must understand human criteria for regular arrangements, such as several types of symmetry, co-linearity or co-circularity, spacing uniformity in linear or circular patterns, and further inter-object relationships that relate to style and functionality. Previous approaches for this task relied on human input to explicitly specify goal state, or synthesized scenes from scratch -- but such methods do not address the rearrangement of existing messy scenes without providing a goal state. In this paper, we present LEGO-Net, a data-driven transformer-based iterative method for learning regular rearrangement of objects in messy rooms. LEGO-Net is partly inspired by diffusion models -- it starts with an initial messy state and iteratively "de-noises'' the position and orientation of objects to a regular state while reducing the distance traveled. Given randomly perturbed object positions and orientations in an existing dataset of professionally-arranged scenes, our method is trained to recover a regular re-arrangement. Results demonstrate that our method is able to reliably rearrange room scenes and outperform other methods. We additionally propose a metric for evaluating regularity in room arrangements using number-theoretic machinery.

* Project page: https://ivl.cs.brown.edu/projects/lego-net

Via

Access Paper or Ask Questions