Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sifei Liu

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model

Jun 03, 2024

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, Sifei Liu

Figure 1 for SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model

Figure 2 for SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model

Figure 3 for SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model

Figure 4 for SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model

Abstract:Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. SpatialRGPT advances VLMs' spatial understanding through two key innovations: (1) a data curation pipeline that enables effective learning of regional representation from 3D scene graphs, and (2) a flexible plugin module for integrating depth information into the visual encoder of existing VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances. Additionally, we propose SpatialRGBT-Bench, a benchmark with ground-truth 3D annotations encompassing indoor, outdoor, and simulated environments, for evaluating 3D spatial cognition in VLMs. Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts. The model also exhibits strong generalization capabilities, effectively reasoning about complex spatial relations and functioning as a region-aware dense reward annotator for robotic tasks. Code, dataset, and benchmark will be released at https://www.anjiecheng.me/SpatialRGPT

* Project Page: https://www.anjiecheng.me/SpatialRGPT

Via

Access Paper or Ask Questions

Compositional Text-to-Image Generation with Dense Blob Representations

May 14, 2024

Weili Nie, Sifei Liu, Morteza Mardani, Chao Liu, Benjamin Eckart, Arash Vahdat

Figure 1 for Compositional Text-to-Image Generation with Dense Blob Representations

Figure 2 for Compositional Text-to-Image Generation with Dense Blob Representations

Figure 3 for Compositional Text-to-Image Generation with Dense Blob Representations

Figure 4 for Compositional Text-to-Image Generation with Dense Blob Representations

Abstract:Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks. Project page: https://blobgen-2d.github.io.

* ICML 2024

Via

Access Paper or Ask Questions

HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

Mar 18, 2024

Mengqi Zhang, Yang Fu, Zheng Ding, Sifei Liu, Zhuowen Tu, Xiaolong Wang

Figure 1 for HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

Figure 2 for HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

Figure 3 for HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

Figure 4 for HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

Abstract:3D hand-object interaction data is scarce due to the hardware constraints in scaling up the data collection process. In this paper, we propose HOIDiffusion for generating realistic and diverse 3D hand-object interaction data. Our model is a conditional diffusion model that takes both the 3D hand-object geometric structure and text description as inputs for image synthesis. This offers a more controllable and realistic synthesis as we can specify the structure and style inputs in a disentangled manner. HOIDiffusion is trained by leveraging a diffusion model pre-trained on large-scale natural images and a few 3D human demonstrations. Beyond controllable image synthesis, we adopt the generated 3D data for learning 6D object pose estimation and show its effectiveness in improving perception systems. Project page: https://mq-zhang1.github.io/HOIDiffusion

* Project page: https://mq-zhang1.github.io/HOIDiffusion

Via

Access Paper or Ask Questions

RegionGPT: Towards Region Understanding Vision Language Model

Mar 04, 2024

Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, Sifei Liu

Figure 1 for RegionGPT: Towards Region Understanding Vision Language Model

Figure 2 for RegionGPT: Towards Region Understanding Vision Language Model

Figure 3 for RegionGPT: Towards Region Understanding Vision Language Model

Figure 4 for RegionGPT: Towards Region Understanding Vision Language Model

Abstract:Vision language models (VLMs) have experienced rapid advancements through the integration of large language models (LLMs) with image-text pairs, yet they struggle with detailed regional visual understanding due to limited spatial awareness of the vision encoder, and the use of coarse-grained training data that lacks detailed, region-specific captions. To address this, we introduce RegionGPT (short as RGPT), a novel framework designed for complex region-level captioning and understanding. RGPT enhances the spatial awareness of regional representation with simple yet effective modifications to existing visual encoders in VLMs. We further improve performance on tasks requiring a specific output scope by integrating task-guided instruction prompts during both training and inference phases, while maintaining the model's versatility for general-purpose tasks. Additionally, we develop an automated region caption data generation pipeline, enriching the training set with detailed region-level captions. We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of region-level tasks, including but not limited to complex region descriptions, reasoning, object classification, and referring expressions comprehension.

* Accepted by CVPR 2024

Via

Access Paper or Ask Questions

RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos

Jan 24, 2024

Hongchi Xia, Yang Fu, Sifei Liu, Xiaolong Wang

Figure 1 for RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos

Figure 2 for RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos

Figure 3 for RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos

Figure 4 for RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos

Abstract:We introduce a new RGB-D object dataset captured in the wild called WildRGB-D. Unlike most existing real-world object-centric datasets which only come with RGB capturing, the direct capture of the depth channel allows better 3D annotations and broader downstream applications. WildRGB-D comprises large-scale category-level RGB-D object videos, which are taken using an iPhone to go around the objects in 360 degrees. It contains around 8500 recorded objects and nearly 20000 RGB-D videos across 46 common object categories. These videos are taken with diverse cluttered backgrounds with three setups to cover as many real-world scenarios as possible: (i) a single object in one video; (ii) multiple objects in one video; and (iii) an object with a static hand in one video. The dataset is annotated with object masks, real-world scale camera poses, and reconstructed aggregated point clouds from RGBD videos. We benchmark four tasks with WildRGB-D including novel view synthesis, camera pose estimation, object 6d pose estimation, and object surface reconstruction. Our experiments show that the large-scale capture of RGB-D objects provides a large potential to advance 3D object learning. Our project page is https://wildrgbd.github.io/.

* Our project page: https://wildrgbd.github.io/

Via

Access Paper or Ask Questions

AGG: Amortized Generative 3D Gaussians for Single Image to 3D

Jan 08, 2024

Dejia Xu, Ye Yuan, Morteza Mardani, Sifei Liu, Jiaming Song, Zhangyang Wang, Arash Vahdat

Figure 1 for AGG: Amortized Generative 3D Gaussians for Single Image to 3D

Figure 2 for AGG: Amortized Generative 3D Gaussians for Single Image to 3D

Figure 3 for AGG: Amortized Generative 3D Gaussians for Single Image to 3D

Figure 4 for AGG: Amortized Generative 3D Gaussians for Single Image to 3D

Abstract:Given the growing need for automatic 3D content creation pipelines, various 3D representations have been studied to generate 3D objects from a single image. Due to its superior rendering efficiency, 3D Gaussian splatting-based models have recently excelled in both 3D reconstruction and generation. 3D Gaussian splatting approaches for image to 3D generation are often optimization-based, requiring many computationally expensive score-distillation steps. To overcome these challenges, we introduce an Amortized Generative 3D Gaussian framework (AGG) that instantly produces 3D Gaussians from a single image, eliminating the need for per-instance optimization. Utilizing an intermediate hybrid representation, AGG decomposes the generation of 3D Gaussian locations and other appearance attributes for joint optimization. Moreover, we propose a cascaded pipeline that first generates a coarse representation of the 3D data and later upsamples it with a 3D Gaussian super-resolution module. Our method is evaluated against existing optimization-based 3D Gaussian frameworks and sampling-based pipelines utilizing other 3D representations, where AGG showcases competitive generation abilities both qualitatively and quantitatively while being several orders of magnitude faster. Project page: https://ir1d.github.io/AGG/

* Project page: https://ir1d.github.io/AGG/

Via

Access Paper or Ask Questions

COLMAP-Free 3D Gaussian Splatting

Dec 12, 2023

Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A. Efros, Xiaolong Wang

Figure 1 for COLMAP-Free 3D Gaussian Splatting

Figure 2 for COLMAP-Free 3D Gaussian Splatting

Figure 3 for COLMAP-Free 3D Gaussian Splatting

Figure 4 for COLMAP-Free 3D Gaussian Splatting

Abstract:While neural rendering has led to impressive advances in scene reconstruction and novel view synthesis, it relies heavily on accurately pre-computed camera poses. To relax this constraint, multiple efforts have been made to train Neural Radiance Fields (NeRFs) without pre-processed camera poses. However, the implicit representations of NeRFs provide extra challenges to optimize the 3D structure and camera poses at the same time. On the other hand, the recently proposed 3D Gaussian Splatting provides new opportunities given its explicit point cloud representations. This paper leverages both the explicit geometric representation and the continuity of the input video stream to perform novel view synthesis without any SfM preprocessing. We process the input frames in a sequential manner and progressively grow the 3D Gaussians set by taking one input frame at a time, without the need to pre-compute the camera poses. Our method significantly improves over previous approaches in view synthesis and camera pose estimation under large motion changes. Our project page is https://oasisyang.github.io/colmap-free-3dgs

* Project Page: https://oasisyang.github.io/colmap-free-3dgs

Via

Access Paper or Ask Questions

A Unified Approach for Text- and Image-guided 4D Scene Generation

Nov 29, 2023

Yufeng Zheng, Xueting Li, Koki Nagano, Sifei Liu, Karsten Kreis, Otmar Hilliges, Shalini De Mello

Abstract:Large-scale diffusion generative models are greatly simplifying image, video and 3D asset creation from user-provided text prompts and images. However, the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D, which features a novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D diffusion guidance to effectively learn a high-quality static 3D asset in the first stage; (2) a deformable neural radiance field that explicitly disentangles the learned static asset from its deformation, preserving quality during motion learning; and (3) a multi-resolution feature grid for the deformation field with a displacement total variation loss to effectively learn motion with video diffusion guidance in the second stage. Through a user preference study, we demonstrate that our approach significantly advances image and motion quality, 3D consistency and text fidelity for text-to-4D generation compared to baseline approaches. Thanks to its motion-disentangled representation, Dream-in-4D can also be easily adapted for controllable generation where appearance is defined by one or multiple images, without the need to modify the motion learning stage. Thus, our method offers, for the first time, a unified approach for text-to-4D, image-to-4D and personalized 4D generation tasks.

* Project page: https://research.nvidia.com/labs/nxp/dream-in-4d/

Via

Access Paper or Ask Questions

3D Reconstruction with Generalizable Neural Fields using Scene Priors

Sep 29, 2023

Yang Fu, Shalini De Mello, Xueting Li, Amey Kulkarni, Jan Kautz, Xiaolong Wang, Sifei Liu

Figure 1 for 3D Reconstruction with Generalizable Neural Fields using Scene Priors

Figure 2 for 3D Reconstruction with Generalizable Neural Fields using Scene Priors

Figure 3 for 3D Reconstruction with Generalizable Neural Fields using Scene Priors

Figure 4 for 3D Reconstruction with Generalizable Neural Fields using Scene Priors

Abstract:High-fidelity 3D scene reconstruction has been substantially advanced by recent progress in neural fields. However, most existing methods train a separate network from scratch for each individual scene. This is not scalable, inefficient, and unable to yield good results given limited views. While learning-based multi-view stereo methods alleviate this issue to some extent, their multi-view setting makes it less flexible to scale up and to broad applications. Instead, we introduce training generalizable Neural Fields incorporating scene Priors (NFPs). The NFP network maps any single-view RGB-D image into signed distance and radiance values. A complete scene can be reconstructed by merging individual frames in the volumetric space WITHOUT a fusion module, which provides better flexibility. The scene priors can be trained on large-scale datasets, allowing for fast adaptation to the reconstruction of a new scene with fewer views. NFP not only demonstrates SOTA scene reconstruction performance and efficiency, but it also supports single-image novel-view synthesis, which is underexplored in neural fields. More qualitative results are available at: https://oasisyang.github.io/neural-prior

* Project Page: https://oasisyang.github.io/neural-prior

Via

Access Paper or Ask Questions

Generalizable One-shot Neural Head Avatar

Jun 14, 2023

Xueting Li, Shalini De Mello, Sifei Liu, Koki Nagano, Umar Iqbal, Jan Kautz

Abstract:We present a method that reconstructs and animates a 3D head avatar from a single-view portrait image. Existing methods either involve time-consuming optimization for a specific person with multiple images, or they struggle to synthesize intricate appearance details beyond the facial region. To address these limitations, we propose a framework that not only generalizes to unseen identities based on a single-view image without requiring person-specific optimization, but also captures characteristic details within and beyond the face area (e.g. hairstyle, accessories, etc.). At the core of our method are three branches that produce three tri-planes representing the coarse 3D geometry, detailed appearance of a source image, as well as the expression of a target image. By applying volumetric rendering to the combination of the three tri-planes followed by a super-resolution module, our method yields a high fidelity image of the desired identity, expression and pose. Once trained, our model enables efficient 3D head avatar reconstruction and animation via a single forward pass through a network. Experiments show that the proposed approach generalizes well to unseen validation datasets, surpassing SOTA baseline methods by a large margin on head avatar reconstruction and animation.

Via

Access Paper or Ask Questions