Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hujun Bao

Zhejiang University

Generating Human Motion in 3D Scenes from Text Descriptions

May 13, 2024

Zhi Cen, Huaijin Pi, Sida Peng, Zehong Shen, Minghui Yang, Shuai Zhu, Hujun Bao, Xiaowei Zhou

Figure 1 for Generating Human Motion in 3D Scenes from Text Descriptions

Figure 2 for Generating Human Motion in 3D Scenes from Text Descriptions

Figure 3 for Generating Human Motion in 3D Scenes from Text Descriptions

Figure 4 for Generating Human Motion in 3D Scenes from Text Descriptions

Abstract:Generating human motions from textual descriptions has gained growing research interest due to its wide range of applications. However, only a few works consider human-scene interactions together with text conditions, which is crucial for visual and physical realism. This paper focuses on the task of generating human motions in 3D indoor scenes given text descriptions of the human-scene interactions. This task presents challenges due to the multi-modality nature of text, scene, and motion, as well as the need for spatial reasoning. To address these challenges, we propose a new approach that decomposes the complex problem into two more manageable sub-problems: (1) language grounding of the target object and (2) object-centric motion generation. For language grounding of the target object, we leverage the power of large language models. For motion generation, we design an object-centric scene representation for the generative model to focus on the target object, thereby reducing the scene complexity and facilitating the modeling of the relationship between human motions and the object. Experiments demonstrate the better motion quality of our approach compared to baselines and validate our design choices.

* Project page: https://zju3dv.github.io/text_scene_motion

Via

Access Paper or Ask Questions

MaPa: Text-driven Photorealistic Material Painting for 3D Shapes

Apr 26, 2024

Shangzhan Zhang, Sida Peng, Tao Xu, Yuanbo Yang, Tianrun Chen, Nan Xue, Yujun Shen, Hujun Bao, Ruizhen Hu, Xiaowei Zhou

Figure 1 for MaPa: Text-driven Photorealistic Material Painting for 3D Shapes

Figure 2 for MaPa: Text-driven Photorealistic Material Painting for 3D Shapes

Figure 3 for MaPa: Text-driven Photorealistic Material Painting for 3D Shapes

Figure 4 for MaPa: Text-driven Photorealistic Material Painting for 3D Shapes

Abstract:This paper aims to generate materials for 3D meshes from text descriptions. Unlike existing methods that synthesize texture maps, we propose to generate segment-wise procedural material graphs as the appearance representation, which supports high-quality rendering and provides substantial flexibility in editing. Instead of relying on extensive paired data, i.e., 3D meshes with material graphs and corresponding text descriptions, to train a material graph generative model, we propose to leverage the pre-trained 2D diffusion model as a bridge to connect the text and material graphs. Specifically, our approach decomposes a shape into a set of segments and designs a segment-controlled diffusion model to synthesize 2D images that are aligned with mesh parts. Based on generated images, we initialize parameters of material graphs and fine-tune them through the differentiable rendering module to produce materials in accordance with the textual description. Extensive experiments demonstrate the superior performance of our framework in photorealism, resolution, and editability over existing methods. Project page: https://zhanghe3z.github.io/MaPa/

* SIGGRAPH 2024. Project page: https://zhanghe3z.github.io/MaPa/

Via

Access Paper or Ask Questions

Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment

Apr 15, 2024

Shuaiying Hou, Hongyu Tao, Junheng Fang, Changqing Zou, Hujun Bao, Weiwei Xu

Figure 1 for Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment

Figure 2 for Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment

Figure 3 for Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment

Figure 4 for Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment

Abstract:Learning 3D human motion from 2D inputs is a fundamental task in the realms of computer vision and computer graphics. Many previous methods grapple with this inherently ambiguous task by introducing motion priors into the learning process. However, these approaches face difficulties in defining the complete configurations of such priors or training a robust model. In this paper, we present the Video-to-Motion Generator (VTM), which leverages motion priors through cross-modal latent feature space alignment between 3D human motion and 2D inputs, namely videos and 2D keypoints. To reduce the complexity of modeling motion priors, we model the motion data separately for the upper and lower body parts. Additionally, we align the motion data with a scale-invariant virtual skeleton to mitigate the interference of human skeleton variations to the motion priors. Evaluated on AIST++, the VTM showcases state-of-the-art performance in reconstructing 3D human motion from monocular videos. Notably, our VTM exhibits the capabilities for generalization to unseen view angles and in-the-wild videos.

Via

Access Paper or Ask Questions

GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image

Apr 02, 2024

Chong Bao, Yinda Zhang, Yuan Li, Xiyu Zhang, Bangbang Yang, Hujun Bao, Marc Pollefeys, Guofeng Zhang, Zhaopeng Cui

Abstract:Recently, we have witnessed the explosive growth of various volumetric representations in modeling animatable head avatars. However, due to the diversity of frameworks, there is no practical method to support high-level applications like 3D head avatar editing across different representations. In this paper, we propose a generic avatar editing approach that can be universally applied to various 3DMM driving volumetric head avatars. To achieve this goal, we design a novel expression-aware modification generative model, which enables lift 2D editing from a single image to a consistent 3D modification field. To ensure the effectiveness of the generative modification process, we develop several techniques, including an expression-dependent modification distillation scheme to draw knowledge from the large-scale head avatar model and 2D facial texture editing tools, implicit latent space guidance to enhance model convergence, and a segmentation-based loss reweight strategy for fine-grained texture inversion. Extensive experiments demonstrate that our method delivers high-quality and consistent results across multiple expression and viewpoints. Project page: https://zju3dv.github.io/geneavatar/

* Accepted to CVPR 2024. Project page: https://zju3dv.github.io/geneavatar/

Via

Access Paper or Ask Questions

CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-aware 3D Gaussian Field

Mar 24, 2024

Jiarui Hu, Xianhao Chen, Boyin Feng, Guanglin Li, Liangjing Yang, Hujun Bao, Guofeng Zhang, Zhaopeng Cui

Figure 1 for CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-aware 3D Gaussian Field

Figure 2 for CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-aware 3D Gaussian Field

Figure 3 for CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-aware 3D Gaussian Field

Figure 4 for CG-SLAM: Efficient Dense RGB-D SLAM in a Consistent Uncertainty-aware 3D Gaussian Field

Abstract:Recently neural radiance fields (NeRF) have been widely exploited as 3D representations for dense simultaneous localization and mapping (SLAM). Despite their notable successes in surface modeling and novel view synthesis, existing NeRF-based methods are hindered by their computationally intensive and time-consuming volume rendering pipeline. This paper presents an efficient dense RGB-D SLAM system, i.e., CG-SLAM, based on a novel uncertainty-aware 3D Gaussian field with high consistency and geometric stability. Through an in-depth analysis of Gaussian Splatting, we propose several techniques to construct a consistent and stable 3D Gaussian field suitable for tracking and mapping. Additionally, a novel depth uncertainty model is proposed to ensure the selection of valuable Gaussian primitives during optimization, thereby improving tracking efficiency and accuracy. Experiments on various datasets demonstrate that CG-SLAM achieves superior tracking and mapping performance with a notable tracking speed of up to 15 Hz. We will make our source code publicly available. Project page: https://zju3dv.github.io/cg-slam.

* Project Page: https://zju3dv.github.io/cg-slam

Via

Access Paper or Ask Questions

Boosting Image Restoration via Priors from Pre-trained Models

Mar 19, 2024

Xiaogang Xu, Shu Kong, Tao Hu, Zhe Liu, Hujun Bao

Figure 1 for Boosting Image Restoration via Priors from Pre-trained Models

Figure 2 for Boosting Image Restoration via Priors from Pre-trained Models

Figure 3 for Boosting Image Restoration via Priors from Pre-trained Models

Figure 4 for Boosting Image Restoration via Priors from Pre-trained Models

Abstract:Pre-trained models with large-scale training data, such as CLIP and Stable Diffusion, have demonstrated remarkable performance in various high-level computer vision tasks such as image understanding and generation from language descriptions. Yet, their potential for low-level tasks such as image restoration remains relatively unexplored. In this paper, we explore such models to enhance image restoration. As off-the-shelf features (OSF) from pre-trained models do not directly serve image restoration, we propose to learn an additional lightweight module called Pre-Train-Guided Refinement Module (PTG-RM) to refine restoration results of a target restoration network with OSF. PTG-RM consists of two components, Pre-Train-Guided Spatial-Varying Enhancement (PTG-SVE), and Pre-Train-Guided Channel-Spatial Attention (PTG-CSA). PTG-SVE enables optimal short- and long-range neural operations, while PTG-CSA enhances spatial-channel attention for restoration-related learning. Extensive experiments demonstrate that PTG-RM, with its compact size ($<$1M parameters), effectively enhances restoration performance of various models across different tasks, including low-light enhancement, deraining, deblurring, and denoising.

* CVPR2024

Via

Access Paper or Ask Questions

Vox-Fusion++: Voxel-based Neural Implicit Dense Tracking and Mapping with Multi-maps

Mar 19, 2024

Hongjia Zhai, Hai Li, Xingrui Yang, Gan Huang, Yuhang Ming, Hujun Bao, Guofeng Zhang

Abstract:In this paper, we introduce Vox-Fusion++, a multi-maps-based robust dense tracking and mapping system that seamlessly fuses neural implicit representations with traditional volumetric fusion techniques. Building upon the concept of implicit mapping and positioning systems, our approach extends its applicability to real-world scenarios. Our system employs a voxel-based neural implicit surface representation, enabling efficient encoding and optimization of the scene within each voxel. To handle diverse environments without prior knowledge, we incorporate an octree-based structure for scene division and dynamic expansion. To achieve real-time performance, we propose a high-performance multi-process framework. This ensures the system's suitability for applications with stringent time constraints. Additionally, we adopt the idea of multi-maps to handle large-scale scenes, and leverage loop detection and hierarchical pose optimization strategies to reduce long-term pose drift and remove duplicate geometry. Through comprehensive evaluations, we demonstrate that our method outperforms previous methods in terms of reconstruction quality and accuracy across various scenarios. We also show that our Vox-Fusion++ can be used in augmented reality and collaborative mapping applications. Our source code will be publicly available at \url{https://github.com/zju3dv/Vox-Fusion_Plus_Plus}

* 14 pages. arXiv admin note: text overlap with arXiv:2210.15858

Via

Access Paper or Ask Questions

3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation

Mar 14, 2024

Frank Zhang, Yibo Zhang, Quan Zheng, Rui Ma, Wei Hua, Hujun Bao, Weiwei Xu, Changqing Zou

Abstract:Text-driven 3D scene generation techniques have made rapid progress in recent years. Their success is mainly attributed to using existing generative models to iteratively perform image warping and inpainting to generate 3D scenes. However, these methods heavily rely on the outputs of existing models, leading to error accumulation in geometry and appearance that prevent the models from being used in various scenarios (e.g., outdoor and unreal scenarios). To address this limitation, we generatively refine the newly generated local views by querying and aggregating global 3D information, and then progressively generate the 3D scene. Specifically, we employ a tri-plane features-based NeRF as a unified representation of the 3D scene to constrain global 3D consistency, and propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior from 2D diffusion model as well as the global 3D information of the current scene. Our extensive experiments demonstrate that, in comparison to previous methods, our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.

* 11 pages, 7 figures

Via

Access Paper or Ask Questions

Multi-View Neural 3D Reconstruction of Micro-/Nanostructures with Atomic Force Microscopy

Jan 21, 2024

Shuo Chen, Mao Peng, Yijin Li, Bing-Feng Ju, Hujun Bao, Yuan-Liu Chen, Guofeng Zhang

Abstract:Atomic Force Microscopy (AFM) is a widely employed tool for micro-/nanoscale topographic imaging. However, conventional AFM scanning struggles to reconstruct complex 3D micro-/nanostructures precisely due to limitations such as incomplete sample topography capturing and tip-sample convolution artifacts. Here, we propose a multi-view neural-network-based framework with AFM (MVN-AFM), which accurately reconstructs surface models of intricate micro-/nanostructures. Unlike previous works, MVN-AFM does not depend on any specially shaped probes or costly modifications to the AFM system. To achieve this, MVN-AFM uniquely employs an iterative method to align multi-view data and eliminate AFM artifacts simultaneously. Furthermore, we pioneer the application of neural implicit surface reconstruction in nanotechnology and achieve markedly improved results. Extensive experiments show that MVN-AFM effectively eliminates artifacts present in raw AFM images and reconstructs various micro-/nanostructures including complex geometrical microstructures printed via Two-photon Lithography and nanoparticles such as PMMA nanospheres and ZIF-67 nanocrystals. This work presents a cost-effective tool for micro-/nanoscale 3D analysis.

Via

Access Paper or Ask Questions

PNeRFLoc: Visual Localization with Point-based Neural Radiance Fields

Dec 17, 2023

Boming Zhao, Luwei Yang, Mao Mao, Hujun Bao, Zhaopeng Cui

Figure 1 for PNeRFLoc: Visual Localization with Point-based Neural Radiance Fields

Figure 2 for PNeRFLoc: Visual Localization with Point-based Neural Radiance Fields

Figure 3 for PNeRFLoc: Visual Localization with Point-based Neural Radiance Fields

Figure 4 for PNeRFLoc: Visual Localization with Point-based Neural Radiance Fields

Abstract:Due to the ability to synthesize high-quality novel views, Neural Radiance Fields (NeRF) have been recently exploited to improve visual localization in a known environment. However, the existing methods mostly utilize NeRFs for data augmentation to improve the regression model training, and the performance on novel viewpoints and appearances is still limited due to the lack of geometric constraints. In this paper, we propose a novel visual localization framework, \ie, PNeRFLoc, based on a unified point-based representation. On the one hand, PNeRFLoc supports the initial pose estimation by matching 2D and 3D feature points as traditional structure-based methods; on the other hand, it also enables pose refinement with novel view synthesis using rendering-based optimization. Specifically, we propose a novel feature adaption module to close the gaps between the features for visual localization and neural rendering. To improve the efficacy and efficiency of neural rendering-based optimization, we also develop an efficient rendering-based framework with a warping loss function. Furthermore, several robustness techniques are developed to handle illumination changes and dynamic objects for outdoor scenarios. Experiments demonstrate that PNeRFLoc performs the best on synthetic data when the NeRF model can be well learned and performs on par with the SOTA method on the visual localization benchmark datasets.

* Accepted to AAAI 2024

Via

Access Paper or Ask Questions