Alert button
Picture for Jiemin Fang

Jiemin Fang

Alert button

GaussianDreamer: Fast Generation from Text to 3D Gaussian Splatting with Point Cloud Priors

Oct 12, 2023
Taoran Yi, Jiemin Fang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, Xinggang Wang

Figure 1 for GaussianDreamer: Fast Generation from Text to 3D Gaussian Splatting with Point Cloud Priors
Figure 2 for GaussianDreamer: Fast Generation from Text to 3D Gaussian Splatting with Point Cloud Priors
Figure 3 for GaussianDreamer: Fast Generation from Text to 3D Gaussian Splatting with Point Cloud Priors
Figure 4 for GaussianDreamer: Fast Generation from Text to 3D Gaussian Splatting with Point Cloud Priors

In recent times, the generation of 3D assets from text prompts has shown impressive results. Both 2D and 3D diffusion models can generate decent 3D objects based on prompts. 3D diffusion models have good 3D consistency, but their quality and generalization are limited as trainable 3D data is expensive and hard to obtain. 2D diffusion models enjoy strong abilities of generalization and fine generation, but the 3D consistency is hard to guarantee. This paper attempts to bridge the power from the two types of diffusion models via the recent explicit and efficient 3D Gaussian splatting representation. A fast 3D generation framework, named as \name, is proposed, where the 3D diffusion model provides point cloud priors for initialization and the 2D diffusion model enriches the geometry and appearance. Operations of noisy point growing and color perturbation are introduced to enhance the initialized Gaussians. Our \name can generate a high-quality 3D instance within 25 minutes on one GPU, much faster than previous methods, while the generated instances can be directly rendered in real time. Demos and code are available at https://taoranyi.com/gaussiandreamer/.

* Work in progress. Project page: https://taoranyi.com/gaussiandreamer/ 
Viaarxiv icon

4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

Oct 12, 2023
Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, Xinggang Wang

Figure 1 for 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
Figure 2 for 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
Figure 3 for 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
Figure 4 for 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering

Representing and rendering dynamic scenes has been an important but challenging task. Especially, to accurately model complex motions, high efficiency is usually hard to maintain. We introduce the 4D Gaussian Splatting (4D-GS) to achieve real-time dynamic scene rendering while also enjoying high training and storage efficiency. An efficient deformation field is constructed to model both Gaussian motions and shape deformations. Different adjacent Gaussians are connected via a HexPlane to produce more accurate position and shape deformations. Our 4D-GS method achieves real-time rendering under high resolutions, 70 FPS at a 800$\times$800 resolution on an RTX 3090 GPU, while maintaining comparable or higher quality than previous state-of-the-art methods. More demos and code are available at https://guanjunwu.github.io/4dgs/.

* Work in progress. Project page: https://guanjunwu.github.io/4dgs/ 
Viaarxiv icon

TiAVox: Time-aware Attenuation Voxels for Sparse-view 4D DSA Reconstruction

Sep 05, 2023
Zhenghong Zhou, Huangxuan Zhao, Jiemin Fang, Dongqiao Xiang, Lei Chen, Lingxia Wu, Feihong Wu, Wenyu Liu, Chuansheng Zheng, Xinggang Wang

Figure 1 for TiAVox: Time-aware Attenuation Voxels for Sparse-view 4D DSA Reconstruction
Figure 2 for TiAVox: Time-aware Attenuation Voxels for Sparse-view 4D DSA Reconstruction
Figure 3 for TiAVox: Time-aware Attenuation Voxels for Sparse-view 4D DSA Reconstruction
Figure 4 for TiAVox: Time-aware Attenuation Voxels for Sparse-view 4D DSA Reconstruction

Four-dimensional Digital Subtraction Angiography (4D DSA) plays a critical role in the diagnosis of many medical diseases, such as Arteriovenous Malformations (AVM) and Arteriovenous Fistulas (AVF). Despite its significant application value, the reconstruction of 4D DSA demands numerous views to effectively model the intricate vessels and radiocontrast flow, thereby implying a significant radiation dose. To address this high radiation issue, we propose a Time-aware Attenuation Voxel (TiAVox) approach for sparse-view 4D DSA reconstruction, which paves the way for high-quality 4D imaging. Additionally, 2D and 3D DSA imaging results can be generated from the reconstructed 4D DSA images. TiAVox introduces 4D attenuation voxel grids, which reflect attenuation properties from both spatial and temporal dimensions. It is optimized by minimizing discrepancies between the rendered images and sparse 2D DSA images. Without any neural network involved, TiAVox enjoys specific physical interpretability. The parameters of each learnable voxel represent the attenuation coefficients. We validated the TiAVox approach on both clinical and simulated datasets, achieving a 31.23 Peak Signal-to-Noise Ratio (PSNR) for novel view synthesis using only 30 views on the clinically sourced dataset, whereas traditional Feldkamp-Davis-Kress methods required 133 views. Similarly, with merely 10 views from the synthetic dataset, TiAVox yielded a PSNR of 34.32 for novel view synthesis and 41.40 for 3D reconstruction. We also executed ablation studies to corroborate the essential components of TiAVox. The code will be publically available.

* 10 pages, 8 figures 
Viaarxiv icon

WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation

Apr 27, 2023
Lianghui Zhu, Yingyue Li, Jiemin Fang, Yan Liu, Hao Xin, Wenyu Liu, Xinggang Wang

Figure 1 for WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation
Figure 2 for WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation
Figure 3 for WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation
Figure 4 for WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation

This paper explores the properties of the plain Vision Transformer (ViT) for Weakly-supervised Semantic Segmentation (WSSS). The class activation map (CAM) is of critical importance for understanding a classification network and launching WSSS. We observe that different attention heads of ViT focus on different image areas. Thus a novel weight-based method is proposed to end-to-end estimate the importance of attention heads, while the self-attention maps are adaptively fused for high-quality CAM results that tend to have more complete objects. Besides, we propose a ViT-based gradient clipping decoder for online retraining with the CAM results to complete the WSSS task. We name this plain Transformer-based Weakly-supervised learning framework WeakTr. It achieves the state-of-the-art WSSS performance on standard benchmarks, i.e., 78.4% mIoU on the val set of PASCAL VOC 2012 and 50.3% mIoU on the val set of COCO 2014. Code is available at https://github.com/hustvl/WeakTr.

* 20 pages, 11 figures 
Viaarxiv icon

Segment Anything in 3D with NeRFs

Apr 26, 2023
Jiazhong Cen, Zanwei Zhou, Jiemin Fang, Wei Shen, Lingxi Xie, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian

Figure 1 for Segment Anything in 3D with NeRFs
Figure 2 for Segment Anything in 3D with NeRFs
Figure 3 for Segment Anything in 3D with NeRFs

The Segment Anything Model (SAM) has demonstrated its effectiveness in segmenting any object/part in various 2D images, yet its ability for 3D has not been fully explored. The real world is composed of numerous 3D scenes and objects. Due to the scarcity of accessible 3D data and high cost of its acquisition and annotation, lifting SAM to 3D is a challenging but valuable research avenue. With this in mind, we propose a novel framework to Segment Anything in 3D, named SA3D. Given a neural radiance field (NeRF) model, SA3D allows users to obtain the 3D segmentation result of any target object via only one-shot manual prompting in a single rendered view. With input prompts, SAM cuts out the target object from the according view. The obtained 2D segmentation mask is projected onto 3D mask grids via density-guided inverse rendering. 2D masks from other views are then rendered, which are mostly uncompleted but used as cross-view self-prompts to be fed into SAM again. Complete masks can be obtained and projected onto mask grids. This procedure is executed via an iterative manner while accurate 3D masks can be finally learned. SA3D can adapt to various radiance fields effectively without any additional redesigning. The entire segmentation process can be completed in approximately two minutes without any engineering optimization. Our experiments demonstrate the effectiveness of SA3D in different scenes, highlighting the potential of SAM in 3D scene perception. The project page is at https://jumpat.github.io/SA3D/.

* Work in progress. Project page: https://jumpat.github.io/SA3D/ 
Viaarxiv icon

TinyDet: Accurate Small Object Detection in Lightweight Generic Detectors

Apr 07, 2023
Shaoyu Chen, Tianheng Cheng, Jiemin Fang, Qian Zhang, Yuan Li, Wenyu Liu, Xinggang Wang

Figure 1 for TinyDet: Accurate Small Object Detection in Lightweight Generic Detectors
Figure 2 for TinyDet: Accurate Small Object Detection in Lightweight Generic Detectors
Figure 3 for TinyDet: Accurate Small Object Detection in Lightweight Generic Detectors
Figure 4 for TinyDet: Accurate Small Object Detection in Lightweight Generic Detectors

Small object detection requires the detection head to scan a large number of positions on image feature maps, which is extremely hard for computation- and energy-efficient lightweight generic detectors. To accurately detect small objects with limited computation, we propose a two-stage lightweight detection framework with extremely low computation complexity, termed as TinyDet. It enables high-resolution feature maps for dense anchoring to better cover small objects, proposes a sparsely-connected convolution for computation reduction, enhances the early stage features in the backbone, and addresses the feature misalignment problem for accurate small object detection. On the COCO benchmark, our TinyDet-M achieves 30.3 AP and 13.5 AP^s with only 991 MFLOPs, which is the first detector that has an AP over 30 with less than 1 GFLOPs; besides, TinyDet-S and TinyDet-L achieve promising performance under different computation limitation.

Viaarxiv icon

Generalizable Neural Voxels for Fast Human Radiance Fields

Mar 27, 2023
Taoran Yi, Jiemin Fang, Xinggang Wang, Wenyu Liu

Figure 1 for Generalizable Neural Voxels for Fast Human Radiance Fields
Figure 2 for Generalizable Neural Voxels for Fast Human Radiance Fields
Figure 3 for Generalizable Neural Voxels for Fast Human Radiance Fields
Figure 4 for Generalizable Neural Voxels for Fast Human Radiance Fields

Rendering moving human bodies at free viewpoints only from a monocular video is quite a challenging problem. The information is too sparse to model complicated human body structures and motions from both view and pose dimensions. Neural radiance fields (NeRF) have shown great power in novel view synthesis and have been applied to human body rendering. However, most current NeRF-based methods bear huge costs for both training and rendering, which impedes the wide applications in real-life scenarios. In this paper, we propose a rendering framework that can learn moving human body structures extremely quickly from a monocular video. The framework is built by integrating both neural fields and neural voxels. Especially, a set of generalizable neural voxels are constructed. With pretrained on various human bodies, these general voxels represent a basic skeleton and can provide strong geometric priors. For the fine-tuning process, individual voxels are constructed for learning differential textures, complementary to general voxels. Thus learning a novel body can be further accelerated, taking only a few minutes. Our method shows significantly higher training efficiency compared with previous methods, while maintaining similar rendering quality. The project page is at https://taoranyi.com/gneuvox .

* Project page: http://taoranyi.com/gneuvox 
Viaarxiv icon

Fast Dynamic Radiance Fields with Time-Aware Neural Voxels

May 30, 2022
Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Matthias Nießner, Qi Tian

Figure 1 for Fast Dynamic Radiance Fields with Time-Aware Neural Voxels
Figure 2 for Fast Dynamic Radiance Fields with Time-Aware Neural Voxels
Figure 3 for Fast Dynamic Radiance Fields with Time-Aware Neural Voxels
Figure 4 for Fast Dynamic Radiance Fields with Time-Aware Neural Voxels

Neural radiance fields (NeRF) have shown great success in modeling 3D scenes and synthesizing novel-view images. However, most previous NeRF methods take much time to optimize one single scene. Explicit data structures, e.g. voxel features, show great potential to accelerate the training process. However, voxel features face two big challenges to be applied to dynamic scenes, i.e. modeling temporal information and capturing different scales of point motions. We propose a radiance field framework by representing scenes with time-aware voxel features, named as TiNeuVox. A tiny coordinate deformation network is introduced to model coarse motion trajectories and temporal information is further enhanced in the radiance network. A multi-distance interpolation method is proposed and applied on voxel features to model both small and large motions. Our framework significantly accelerates the optimization of dynamic radiance fields while maintaining high rendering quality. Empirical evaluation is performed on both synthetic and real scenes. Our TiNeuVox completes training with only 8 minutes and 8-MB storage cost while showing similar or even better rendering performance than previous dynamic NeRF methods.

* Project page: https://jaminfong.cn/tineuvox 
Viaarxiv icon

Temporally Efficient Vision Transformer for Video Instance Segmentation

Apr 18, 2022
Shusheng Yang, Xinggang Wang, Yu Li, Yuxin Fang, Jiemin Fang, Wenyu Liu, Xun Zhao, Ying Shan

Figure 1 for Temporally Efficient Vision Transformer for Video Instance Segmentation
Figure 2 for Temporally Efficient Vision Transformer for Video Instance Segmentation
Figure 3 for Temporally Efficient Vision Transformer for Video Instance Segmentation
Figure 4 for Temporally Efficient Vision Transformer for Video Instance Segmentation

Recently vision transformer has achieved tremendous success on image-level visual recognition tasks. To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS). Different from previous transformer-based VIS methods, TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head. In the backbone stage, we propose a nearly parameter-free messenger shift mechanism for early temporal context fusion. In the head stages, we propose a parameter-shared spatiotemporal query interaction mechanism to build the one-to-one correspondence between video instances and queries. Thus, TeViT fully utilizes both framelevel and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost. On three widely adopted VIS benchmarks, i.e., YouTube-VIS-2019, YouTube-VIS-2021, and OVIS, TeViT obtains state-of-the-art results and maintains high inference speed, e.g., 46.6 AP with 68.9 FPS on YouTube-VIS-2019. Code is available at https://github.com/hustvl/TeViT.

* To appear in CVPR 2022 
Viaarxiv icon