Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

CJ Taylor

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Apr 28, 2024

Xiaolong Li, Jiawei Mo, Ying Wang, Chethan Parameshwara, Xiaohan Fei, Ashwin Swaminathan, CJ Taylor, Zhuowen Tu, Paolo Favaro, Stefano Soatto

Figure 1 for Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Figure 2 for Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Figure 3 for Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Figure 4 for Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

Abstract:In this paper, we propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model. Multi-view diffusion models, such as MVDream, have shown to generate high-fidelity 3D assets using score distillation sampling (SDS). However, applied naively, these methods often fail to comprehend compositional text prompts, and may often entirely omit certain subjects or parts. To address this issue, we first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline. We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation, without the necessity to re-train the multi-view diffusion model or craft a high-quality compositional 3D dataset. We further propose a hybrid optimization strategy to encourage synergy between the SDS loss and the sparse RGB reference images. Our method consistently outperforms previous state-of-the-art (SOTA) methods in generating compositional 3D assets, excelling in both quality and accuracy, and enabling diverse 3D from the same text prompt.

* 9 pages, 10 figures

Via

Access Paper or Ask Questions

A Quantitative Evaluation of Score Distillation Sampling Based Text-to-3D

Feb 29, 2024

Xiaohan Fei, Chethan Parameshwara, Jiawei Mo, Xiaolong Li, Ashwin Swaminathan, CJ Taylor, Paolo Favaro, Stefano Soatto

Abstract:The development of generative models that create 3D content from a text prompt has made considerable strides thanks to the use of the score distillation sampling (SDS) method on pre-trained diffusion models for image generation. However, the SDS method is also the source of several artifacts, such as the Janus problem, the misalignment between the text prompt and the generated 3D model, and 3D model inaccuracies. While existing methods heavily rely on the qualitative assessment of these artifacts through visual inspection of a limited set of samples, in this work we propose more objective quantitative evaluation metrics, which we cross-validate via human ratings, and show analysis of the failure cases of the SDS technique. We demonstrate the effectiveness of this analysis by designing a novel computationally efficient baseline model that achieves state-of-the-art performance on the proposed metrics while addressing all the above-mentioned artifacts.

Via

Access Paper or Ask Questions

Towards Visual Foundational Models of Physical Scenes

Jun 06, 2023

Chethan Parameshwara, Alessandro Achille, Matthew Trager, Xiaolong Li, Jiawei Mo, Ashwin Swaminathan, CJ Taylor, Dheera Venkatraman, Xiaohan Fei, Stefano Soatto

Figure 1 for Towards Visual Foundational Models of Physical Scenes

Figure 2 for Towards Visual Foundational Models of Physical Scenes

Figure 3 for Towards Visual Foundational Models of Physical Scenes

Figure 4 for Towards Visual Foundational Models of Physical Scenes

Abstract:We describe a first step towards learning general-purpose visual representations of physical scenes using only image prediction as a training criterion. To do so, we first define "physical scene" and show that, even though different agents may maintain different representations of the same scene, the underlying physical scene that can be inferred is unique. Then, we show that NeRFs cannot represent the physical scene, as they lack extrapolation mechanisms. Those, however, could be provided by Diffusion Models, at least in theory. To test this hypothesis empirically, NeRFs can be combined with Diffusion Models, a process we refer to as NeRF Diffusion, used as unsupervised representations of the physical scene. Our analysis is limited to visual data, without external grounding mechanisms that can be provided by independent sensory modalities.

* TLDR: Physical scenes are equivalence classes of sufficient statistics, and can be inferred uniquely by any agent measuring the same finite data; We formalize and implement an approach to representation learning that overturns "naive realism" in favor of an analytical approach of Russell and Koenderink. NeRFs cannot capture the physical scenes, but combined with Diffusion Models they can

Via

Access Paper or Ask Questions