Alert button
Picture for Jiale Xu

Jiale Xu

Alert button

InstructP2P: Learning to Edit 3D Point Clouds with Text Instructions

Jun 12, 2023
Jiale Xu, Xintao Wang, Yan-Pei Cao, Weihao Cheng, Ying Shan, Shenghua Gao

Figure 1 for InstructP2P: Learning to Edit 3D Point Clouds with Text Instructions
Figure 2 for InstructP2P: Learning to Edit 3D Point Clouds with Text Instructions
Figure 3 for InstructP2P: Learning to Edit 3D Point Clouds with Text Instructions
Figure 4 for InstructP2P: Learning to Edit 3D Point Clouds with Text Instructions

Enhancing AI systems to perform tasks following human instructions can significantly boost productivity. In this paper, we present InstructP2P, an end-to-end framework for 3D shape editing on point clouds, guided by high-level textual instructions. InstructP2P extends the capabilities of existing methods by synergizing the strengths of a text-conditioned point cloud diffusion model, Point-E, and powerful language models, enabling color and geometry editing using language instructions. To train InstructP2P, we introduce a new shape editing dataset, constructed by integrating a shape segmentation dataset, off-the-shelf shape programs, and diverse edit instructions generated by a large language model, ChatGPT. Our proposed method allows for editing both color and geometry of specific regions in a single forward pass, while leaving other regions unaffected. In our experiments, InstructP2P shows generalization capabilities, adapting to novel shape categories and instructions, despite being trained on a limited amount of data.

Viaarxiv icon

Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models

Dec 28, 2022
Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying Shan, Xiaohu Qie, Shenghua Gao

Figure 1 for Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models
Figure 2 for Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models
Figure 3 for Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models
Figure 4 for Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models

Recent CLIP-guided 3D optimization methods, e.g., DreamFields and PureCLIPNeRF achieve great success in zero-shot text-guided 3D synthesis. However, due to the scratch training and random initialization without any prior knowledge, these methods usually fail to generate accurate and faithful 3D structures that conform to the corresponding text. In this paper, we make the first attempt to introduce the explicit 3D shape prior to CLIP-guided 3D optimization methods. Specifically, we first generate a high-quality 3D shape from input texts in the text-to-shape stage as the 3D shape prior. We then utilize it as the initialization of a neural radiance field and then optimize it with the full prompt. For the text-to-shape generation, we present a simple yet effective approach that directly bridges the text and image modalities with a powerful text-to-image diffusion model. To narrow the style domain gap between images synthesized by the text-to-image model and shape renderings used to train the image-to-shape generator, we further propose to jointly optimize a learnable text prompt and fine-tune the text-to-image diffusion model for rendering-style image generation. Our method, namely, Dream3D, is capable of generating imaginative 3D content with better visual quality and shape accuracy than state-of-the-art methods.

* 20 pages, 15 figures. Project page: https://bluestyle97.github.io/dream3d/ 
Viaarxiv icon

UNIF: United Neural Implicit Functions for Clothed Human Reconstruction and Animation

Jul 20, 2022
Shenhan Qian, Jiale Xu, Ziwei Liu, Liqian Ma, Shenghua Gao

Figure 1 for UNIF: United Neural Implicit Functions for Clothed Human Reconstruction and Animation
Figure 2 for UNIF: United Neural Implicit Functions for Clothed Human Reconstruction and Animation
Figure 3 for UNIF: United Neural Implicit Functions for Clothed Human Reconstruction and Animation
Figure 4 for UNIF: United Neural Implicit Functions for Clothed Human Reconstruction and Animation

We propose united implicit functions (UNIF), a part-based method for clothed human reconstruction and animation with raw scans and skeletons as the input. Previous part-based methods for human reconstruction rely on ground-truth part labels from SMPL and thus are limited to minimal-clothed humans. In contrast, our method learns to separate parts from body motions instead of part supervision, thus can be extended to clothed humans and other articulated objects. Our Partition-from-Motion is achieved by a bone-centered initialization, a bone limit loss, and a section normal loss that ensure stable part division even when the training poses are limited. We also present a minimal perimeter loss for SDF to suppress extra surfaces and part overlapping. Another core of our method is an adjacent part seaming algorithm that produces non-rigid deformations to maintain the connection between parts which significantly relieves the part-based artifacts. Under this algorithm, we further propose "Competing Parts", a method that defines blending weights by the relative position of a point to bones instead of the absolute position, avoiding the generalization problem of neural implicit functions with inverse LBS (linear blend skinning). We demonstrate the effectiveness of our method by clothed human body reconstruction and animation on the CAPE and the ClothSeq datasets.

* Accepted to ECCV 2022 
Viaarxiv icon

TaylorImNet for Fast 3D Shape Reconstruction Based on Implicit Surface Function

Jan 18, 2022
Yuting Xiao, Jiale Xu, Shenghua Gao

Figure 1 for TaylorImNet for Fast 3D Shape Reconstruction Based on Implicit Surface Function
Figure 2 for TaylorImNet for Fast 3D Shape Reconstruction Based on Implicit Surface Function
Figure 3 for TaylorImNet for Fast 3D Shape Reconstruction Based on Implicit Surface Function
Figure 4 for TaylorImNet for Fast 3D Shape Reconstruction Based on Implicit Surface Function

Benefiting from the contiguous representation ability, deep implicit functions can extract the iso-surface of a shape at arbitrary resolution. However, utilizing the neural network with a large number of parameters as the implicit function prevents the generation speed of high-resolution topology because it needs to forward a large number of query points into the network. In this work, we propose TaylorImNet inspired by the Taylor series for implicit 3D shape representation. TaylorImNet exploits a set of discrete expansion points and corresponding Taylor series to model a contiguous implicit shape field. After the expansion points and corresponding coefficients are obtained, our model only needs to calculate the Taylor series to evaluate each point and the number of expansion points is independent of the generating resolution. Based on this representation, our TaylorImNet can achieve a significantly faster generation speed than other baselines. We evaluate our approach on reconstruction tasks from various types of input, and the experimental results demonstrate that our approach can get slightly better performance than existing state-of-the-art baselines while improving the inference speed with a large margin.

* 10 pages, 7 figures 
Viaarxiv icon

OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification

Sep 23, 2021
Xianing Chen, Jialang Xu, Jiale Xu, Shenghua Gao

Figure 1 for OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification
Figure 2 for OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification
Figure 3 for OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification
Figure 4 for OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification

Transformers have shown preferable performance on many vision tasks. However, for the task of person re-identification (ReID), vanilla transformers leave the rich contexts on high-order feature relations under-exploited and deteriorate local feature details, which are insufficient due to the dramatic variations of pedestrians. In this work, we propose an Omni-Relational High-Order Transformer (OH-Former) to model omni-relational features for ReID. First, to strengthen the capacity of visual representation, instead of obtaining the attention matrix based on pairs of queries and isolated keys at each spatial location, we take a step further to model high-order statistics information for the non-local mechanism. We share the attention weights in the corresponding layer of each order with a prior mixing mechanism to reduce the computation cost. Then, a convolution-based local relation perception module is proposed to extract the local relations and 2D position information. The experimental results of our model are superior promising, which show state-of-the-art performance on Market-1501, DukeMTMC, MSMT17 and Occluded-Duke datasets.

Viaarxiv icon

Layout-Guided Novel View Synthesis from a Single Indoor Panorama

Mar 31, 2021
Jiale Xu, Jia Zheng, Yanyu Xu, Rui Tang, Shenghua Gao

Figure 1 for Layout-Guided Novel View Synthesis from a Single Indoor Panorama
Figure 2 for Layout-Guided Novel View Synthesis from a Single Indoor Panorama
Figure 3 for Layout-Guided Novel View Synthesis from a Single Indoor Panorama
Figure 4 for Layout-Guided Novel View Synthesis from a Single Indoor Panorama

Existing view synthesis methods mainly focus on the perspective images and have shown promising results. However, due to the limited field-of-view of the pinhole camera, the performance quickly degrades when large camera movements are adopted. In this paper, we make the first attempt to generate novel views from a single indoor panorama and take the large camera translations into consideration. To tackle this challenging problem, we first use Convolutional Neural Networks (CNNs) to extract the deep features and estimate the depth map from the source-view image. Then, we leverage the room layout prior, a strong structural constraint of the indoor scene, to guide the generation of target views. More concretely, we estimate the room layout in the source view and transform it into the target viewpoint as guidance. Meanwhile, we also constrain the room layout of the generated target-view images to enforce geometric consistency. To validate the effectiveness of our method, we further build a large-scale photo-realistic dataset containing both small and large camera translations. The experimental results on our challenging dataset demonstrate that our method achieves state-of-the-art performance. The project page is at https://github.com/bluestyle97/PNVS.

* To appear in CVPR 2021 
Viaarxiv icon