Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ping Tan

Simon Fraser University

CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner

May 23, 2024

Weiyu Li, Jiarui Liu, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, Xiaoxiao Long

Abstract:We present a novel generative 3D modeling system, coined CraftsMan, which can generate high-fidelity 3D geometries with highly varied shapes, regular mesh topologies, and detailed surfaces, and, notably, allows for refining the geometry in an interactive manner. Despite the significant advancements in 3D generation, existing methods still struggle with lengthy optimization processes, irregular mesh topologies, noisy surfaces, and difficulties in accommodating user edits, consequently impeding their widespread adoption and implementation in 3D modeling software. Our work is inspired by the craftsman, who usually roughs out the holistic figure of the work first and elaborates the surface details subsequently. Specifically, we employ a 3D native diffusion model, which operates on latent space learned from latent set-based 3D representations, to generate coarse geometries with regular mesh topology in seconds. In particular, this process takes as input a text prompt or a reference image and leverages a powerful multi-view (MV) diffusion model to generate multiple views of the coarse geometry, which are fed into our MV-conditioned 3D diffusion model for generating the 3D geometry, significantly improving robustness and generalizability. Following that, a normal-based geometry refiner is used to significantly enhance the surface details. This refinement can be performed automatically, or interactively with user-supplied edits. Extensive experiments demonstrate that our method achieves high efficacy in producing superior-quality 3D assets compared to existing methods. HomePage: https://craftsman3d.github.io/, Code: https://github.com/wyysf-98/CraftsMan

* HomePage: https://craftsman3d.github.io/, Code: https://github.com/wyysf-98/CraftsMan

Via

Access Paper or Ask Questions

Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

May 19, 2024

Peng Li, Yuan Liu, Xiaoxiao Long, Feihu Zhang, Cheng Lin, Mengfei Li, Xingqun Qi, Shanghang Zhang, Wenhan Luo, Ping Tan(+3 more)

Figure 1 for Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

Figure 2 for Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

Figure 3 for Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

Figure 4 for Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

Abstract:In this paper, we introduce Era3D, a novel multiview diffusion method that generates high-resolution multiview images from a single-view image. Despite significant advancements in multiview generation, existing methods still suffer from camera prior mismatch, inefficacy, and low resolution, resulting in poor-quality multiview images. Specifically, these methods assume that the input images should comply with a predefined camera type, e.g. a perspective camera with a fixed focal length, leading to distorted shapes when the assumption fails. Moreover, the full-image or dense multiview attention they employ leads to an exponential explosion of computational complexity as image resolution increases, resulting in prohibitively expensive training costs. To bridge the gap between assumption and reality, Era3D first proposes a diffusion-based camera prediction module to estimate the focal length and elevation of the input image, which allows our method to generate images without shape distortions. Furthermore, a simple but efficient attention layer, named row-wise attention, is used to enforce epipolar priors in the multiview diffusion, facilitating efficient cross-view information fusion. Consequently, compared with state-of-the-art methods, Era3D generates high-quality multiview images with up to a 512*512 resolution while reducing computation complexity by 12x times. Comprehensive experiments demonstrate that Era3D can reconstruct high-quality and detailed 3D meshes from diverse single-view input images, significantly outperforming baseline multiview diffusion methods.

Via

Access Paper or Ask Questions

Taming Diffusion Probabilistic Models for Character Control

Apr 23, 2024

Rui Chen, Mingyi Shi, Shaoli Huang, Ping Tan, Taku Komura, Xuelin Chen

Figure 1 for Taming Diffusion Probabilistic Models for Character Control

Figure 2 for Taming Diffusion Probabilistic Models for Character Control

Figure 3 for Taming Diffusion Probabilistic Models for Character Control

Figure 4 for Taming Diffusion Probabilistic Models for Character Control

Abstract:We present a novel character control framework that effectively utilizes motion diffusion probabilistic models to generate high-quality and diverse character animations, responding in real-time to a variety of dynamic user-supplied control signals. At the heart of our method lies a transformer-based Conditional Autoregressive Motion Diffusion Model (CAMDM), which takes as input the character's historical motion and can generate a range of diverse potential future motions conditioned on high-level, coarse user control. To meet the demands for diversity, controllability, and computational efficiency required by a real-time controller, we incorporate several key algorithmic designs. These include separate condition tokenization, classifier-free guidance on past motion, and heuristic future trajectory extension, all designed to address the challenges associated with taming motion diffusion probabilistic models for character control. As a result, our work represents the first model that enables real-time generation of high-quality, diverse character animations based on user interactive control, supporting animating the character in multiple styles with a single unified model. We evaluate our method on a diverse set of locomotion skills, demonstrating the merits of our method over existing character controllers. Project page and source codes: https://aiganimation.github.io/CAMDM/

* Accepted by SIGGRAPH 2024 (Conference Track). Project page and source codes: https://aiganimation.github.io/CAMDM/

Via

Access Paper or Ask Questions

GenN2N: Generative NeRF2NeRF Translation

Apr 03, 2024

Xiangyue Liu, Han Xue, Kunming Luo, Ping Tan, Li Yi

Figure 1 for GenN2N: Generative NeRF2NeRF Translation

Figure 2 for GenN2N: Generative NeRF2NeRF Translation

Figure 3 for GenN2N: Generative NeRF2NeRF Translation

Figure 4 for GenN2N: Generative NeRF2NeRF Translation

Abstract:We present GenN2N, a unified NeRF-to-NeRF translation framework for various NeRF translation tasks such as text-driven NeRF editing, colorization, super-resolution, inpainting, etc. Unlike previous methods designed for individual translation tasks with task-specific schemes, GenN2N achieves all these NeRF editing tasks by employing a plug-and-play image-to-image translator to perform editing in the 2D domain and lifting 2D edits into the 3D NeRF space. Since the 3D consistency of 2D edits may not be assured, we propose to model the distribution of the underlying 3D edits through a generative model that can cover all possible edited NeRFs. To model the distribution of 3D edited NeRFs from 2D edited images, we carefully design a VAE-GAN that encodes images while decoding NeRFs. The latent space is trained to align with a Gaussian distribution and the NeRFs are supervised through an adversarial loss on its renderings. To ensure the latent code does not depend on 2D viewpoints but truly reflects the 3D edits, we also regularize the latent code through a contrastive learning scheme. Extensive experiments on various editing tasks show GenN2N, as a universal framework, performs as well or better than task-specific specialists while possessing flexible generative power. More results on our project page: https://xiangyueliu.github.io/GenN2N/

* Accepted to CVPR 2024. Project page: https://xiangyueliu.github.io/GenN2N/

Via

Access Paper or Ask Questions

Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes

Apr 02, 2024

Ziqian Bai, Feitong Tan, Sean Fanello, Rohit Pandey, Mingsong Dou, Shichen Liu, Ping Tan, Yinda Zhang

Abstract:3D head avatars built with neural implicit volumetric representations have achieved unprecedented levels of photorealism. However, the computational cost of these methods remains a significant barrier to their widespread adoption, particularly in real-time applications such as virtual reality and teleconferencing. While attempts have been made to develop fast neural rendering approaches for static scenes, these methods cannot be simply employed to support realistic facial expressions, such as in the case of a dynamic facial performance. To address these challenges, we propose a novel fast 3D neural implicit head avatar model that achieves real-time rendering while maintaining fine-grained controllability and high rendering quality. Our key idea lies in the introduction of local hash table blendshapes, which are learned and attached to the vertices of an underlying face parametric model. These per-vertex hash-tables are linearly merged with weights predicted via a CNN, resulting in expression dependent embeddings. Our novel representation enables efficient density and color predictions using a lightweight MLP, which is further accelerated by a hierarchical nearest neighbor search method. Extensive experiments show that our approach runs in real-time while achieving comparable rendering quality to state-of-the-arts and decent results on challenging expressions.

* In CVPR2024. Project page: https://augmentedperception.github.io/monoavatar-plus

Via

Access Paper or Ask Questions

Bilateral Propagation Network for Depth Completion

Apr 01, 2024

Jie Tang, Fei-Peng Tian, Boshi An, Jian Li, Ping Tan

Figure 1 for Bilateral Propagation Network for Depth Completion

Figure 2 for Bilateral Propagation Network for Depth Completion

Figure 3 for Bilateral Propagation Network for Depth Completion

Figure 4 for Bilateral Propagation Network for Depth Completion

Abstract:Depth completion aims to derive a dense depth map from sparse depth measurements with a synchronized color image. Current state-of-the-art (SOTA) methods are predominantly propagation-based, which work as an iterative refinement on the initial estimated dense depth. However, the initial depth estimations mostly result from direct applications of convolutional layers on the sparse depth map. In this paper, we present a Bilateral Propagation Network (BP-Net), that propagates depth at the earliest stage to avoid directly convolving on sparse data. Specifically, our approach propagates the target depth from nearby depth measurements via a non-linear model, whose coefficients are generated through a multi-layer perceptron conditioned on both \emph{radiometric difference} and \emph{spatial distance}. By integrating bilateral propagation with multi-modal fusion and depth refinement in a multi-scale framework, our BP-Net demonstrates outstanding performance on both indoor and outdoor scenes. It achieves SOTA on the NYUv2 dataset and ranks 1st on the KITTI depth completion benchmark at the time of submission. Experimental results not only show the effectiveness of bilateral propagation but also emphasize the significance of early-stage propagation in contrast to the refinement stage. Our code and trained models will be available on the project page.

* Accepted by CVPR 2024

Via

Access Paper or Ask Questions

GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image

Mar 18, 2024

Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, Xiaoxiao Long

Figure 1 for GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image

Figure 2 for GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image

Figure 3 for GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image

Figure 4 for GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image

Abstract:We introduce GeoWizard, a new generative foundation model designed for estimating geometric attributes, e.g., depth and normals, from single images. While significant research has already been conducted in this area, the progress has been substantially limited by the low diversity and poor quality of publicly available datasets. As a result, the prior works either are constrained to limited scenarios or suffer from the inability to capture geometric details. In this paper, we demonstrate that generative models, as opposed to traditional discriminative models (e.g., CNNs and Transformers), can effectively address the inherently ill-posed problem. We further show that leveraging diffusion priors can markedly improve generalization, detail preservation, and efficiency in resource usage. Specifically, we extend the original stable diffusion model to jointly predict depth and normal, allowing mutual information exchange and high consistency between the two representations. More importantly, we propose a simple yet effective strategy to segregate the complex data distribution of various scenes into distinct sub-distributions. This strategy enables our model to recognize different scene layouts, capturing 3D geometry with remarkable fidelity. GeoWizard sets new benchmarks for zero-shot depth and normal prediction, significantly enhancing many downstream applications such as 3D reconstruction, 2D content creation, and novel viewpoint synthesis.

* Project page: https://fuxiao0719.github.io/projects/geowizard/

Via

Access Paper or Ask Questions

Edge Detectors Can Make Deep Convolutional Neural Networks More Robust

Feb 26, 2024

Jin Ding, Jie-Chao Zhao, Yong-Zhi Sun, Ping Tan, Jia-Wei Wang, Ji-En Ma, You-Tong Fang

Abstract:Deep convolutional neural networks (DCNN for short) are vulnerable to examples with small perturbations. Improving DCNN's robustness is of great significance to the safety-critical applications, such as autonomous driving and industry automation. Inspired by the principal way that human eyes recognize objects, i.e., largely relying on the shape features, this paper first employs the edge detectors as layer kernels and designs a binary edge feature branch (BEFB for short) to learn the binary edge features, which can be easily integrated into any popular backbone. The four edge detectors can learn the horizontal, vertical, positive diagonal, and negative diagonal edge features, respectively, and the branch is stacked by multiple Sobel layers (using edge detectors as kernels) and one threshold layer. The binary edge features learned by the branch, concatenated with the texture features learned by the backbone, are fed into the fully connected layers for classification. We integrate the proposed branch into VGG16 and ResNet34, respectively, and conduct experiments on multiple datasets. Experimental results demonstrate the BEFB is lightweight and has no side effects on training. And the accuracy of the BEFB integrated models is better than the original ones on all datasets when facing FGSM, PGD, and C\&W attacks. Besides, BEFB integrated models equipped with the robustness enhancing techniques can achieve better classification accuracy compared to the original models. The work in this paper for the first time shows it is feasible to enhance the robustness of DCNNs through combining both shape-like features and texture features.

* 26 pages, 18 figures, 7 tables. submitted to Neural Networks, under review

Via

Access Paper or Ask Questions

Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints

Oct 09, 2023

Chuan Fang, Xiaotao Hu, Kunming Luo, Ping Tan

Figure 1 for Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints

Figure 2 for Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints

Figure 3 for Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints

Figure 4 for Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints

Abstract:Text-driven 3D indoor scene generation could be useful for gaming, film industry, and AR/VR applications. However, existing methods cannot faithfully capture the room layout, nor do they allow flexible editing of individual objects in the room. To address these problems, we present Ctrl-Room, which is able to generate convincing 3D rooms with designer-style layouts and high-fidelity textures from just a text prompt. Moreover, Ctrl-Room enables versatile interactive editing operations such as resizing or moving individual furniture items. Our key insight is to separate the modeling of layouts and appearance. %how to model the room that takes into account both scene texture and geometry at the same time. To this end, Our proposed method consists of two stages, a `Layout Generation Stage' and an `Appearance Generation Stage'. The `Layout Generation Stage' trains a text-conditional diffusion model to learn the layout distribution with our holistic scene code parameterization. Next, the `Appearance Generation Stage' employs a fine-tuned ControlNet to produce a vivid panoramic image of the room guided by the 3D scene layout and text prompt. In this way, we achieve a high-quality 3D room with convincing layouts and lively textures. Benefiting from the scene code parameterization, we can easily edit the generated room model through our mask-guided editing module, without expensive editing-specific training. Extensive experiments on the Structured3D dataset demonstrate that our method outperforms existing methods in producing more reasonable, view-consistent, and editable 3D rooms from natural language prompts.

Via

Access Paper or Ask Questions

SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D

Oct 04, 2023

Weiyu Li, Rui Chen, Xuelin Chen, Ping Tan

Abstract:It is inherently ambiguous to lift 2D results from pre-trained diffusion models to a 3D world for text-to-3D generation. 2D diffusion models solely learn view-agnostic priors and thus lack 3D knowledge during the lifting, leading to the multi-view inconsistency problem. We find that this problem primarily stems from geometric inconsistency, and avoiding misplaced geometric structures substantially mitigates the problem in the final outputs. Therefore, we improve the consistency by aligning the 2D geometric priors in diffusion models with well-defined 3D shapes during the lifting, addressing the vast majority of the problem. This is achieved by fine-tuning the 2D diffusion model to be viewpoint-aware and to produce view-specific coordinate maps of canonically oriented 3D objects. In our process, only coarse 3D information is used for aligning. This "coarse" alignment not only resolves the multi-view inconsistency in geometries but also retains the ability in 2D diffusion models to generate detailed and diversified high-quality objects unseen in the 3D datasets. Furthermore, our aligned geometric priors (AGP) are generic and can be seamlessly integrated into various state-of-the-art pipelines, obtaining high generalizability in terms of unseen shapes and visual appearance while greatly alleviating the multi-view inconsistency problem. Our method represents a new state-of-the-art performance with an 85+% consistency rate by human evaluation, while many previous methods are around 30%. Our project page is https://sweetdreamer3d.github.io/

* Project page: https://sweetdreamer3d.github.io/

Via

Access Paper or Ask Questions