Alert button
Picture for Xiaoshuai Zhang

Xiaoshuai Zhang

Alert button

TensoIR: Tensorial Inverse Rendering

Apr 24, 2023
Haian Jin, Isabella Liu, Peijia Xu, Xiaoshuai Zhang, Songfang Han, Sai Bi, Xiaowei Zhou, Zexiang Xu, Hao Su

We propose TensoIR, a novel inverse rendering approach based on tensor factorization and neural fields. Unlike previous works that use purely MLP-based neural fields, thus suffering from low capacity and high computation costs, we extend TensoRF, a state-of-the-art approach for radiance field modeling, to estimate scene geometry, surface reflectance, and environment illumination from multi-view images captured under unknown lighting conditions. Our approach jointly achieves radiance field reconstruction and physically-based model estimation, leading to photo-realistic novel view synthesis and relighting results. Benefiting from the efficiency and extensibility of the TensoRF-based representation, our method can accurately model secondary shading effects (like shadows and indirect lighting) and generally support input images captured under single or multiple unknown lighting conditions. The low-rank tensor representation allows us to not only achieve fast and compact reconstruction but also better exploit shared information under an arbitrary number of capturing lighting conditions. We demonstrate the superiority of our method to baseline methods qualitatively and quantitatively on various challenging synthetic and real-world scenes.

* Project page: https://haian-jin.github.io/TensoIR 
Viaarxiv icon

Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation from 2D Supervision

Mar 10, 2023
Xiaoshuai Zhang, Abhijit Kundu, Thomas Funkhouser, Leonidas Guibas, Hao Su, Kyle Genova

Figure 1 for Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation from 2D Supervision
Figure 2 for Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation from 2D Supervision
Figure 3 for Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation from 2D Supervision
Figure 4 for Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation from 2D Supervision

We address efficient and structure-aware 3D scene representation from images. Nerflets are our key contribution -- a set of local neural radiance fields that together represent a scene. Each nerflet maintains its own spatial position, orientation, and extent, within which it contributes to panoptic, density, and radiance reconstructions. By leveraging only photometric and inferred panoptic image supervision, we can directly and jointly optimize the parameters of a set of nerflets so as to form a decomposed representation of the scene, where each object instance is represented by a group of nerflets. During experiments with indoor and outdoor environments, we find that nerflets: (1) fit and approximate the scene more efficiently than traditional global NeRFs, (2) allow the extraction of panoptic and photometric renderings from arbitrary views, and (3) enable tasks rare for NeRFs, such as 3D panoptic segmentation and interactive editing.

* accepted by CVPR 2023 
Viaarxiv icon

MovingParts: Motion-based 3D Part Discovery in Dynamic Radiance Field

Mar 10, 2023
Kaizhi Yang, Xiaoshuai Zhang, Zhiao Huang, Xuejin Chen, Zexiang Xu, Hao Su

Figure 1 for MovingParts: Motion-based 3D Part Discovery in Dynamic Radiance Field
Figure 2 for MovingParts: Motion-based 3D Part Discovery in Dynamic Radiance Field
Figure 3 for MovingParts: Motion-based 3D Part Discovery in Dynamic Radiance Field
Figure 4 for MovingParts: Motion-based 3D Part Discovery in Dynamic Radiance Field

We present MovingParts, a NeRF-based method for dynamic scene reconstruction and part discovery. We consider motion as an important cue for identifying parts, that all particles on the same part share the common motion pattern. From the perspective of fluid simulation, existing deformation-based methods for dynamic NeRF can be seen as parameterizing the scene motion under the Eulerian view, i.e., focusing on specific locations in space through which the fluid flows as time passes. However, it is intractable to extract the motion of constituting objects or parts using the Eulerian view representation. In this work, we introduce the dual Lagrangian view and enforce representations under the Eulerian/Lagrangian views to be cycle-consistent. Under the Lagrangian view, we parameterize the scene motion by tracking the trajectory of particles on objects. The Lagrangian view makes it convenient to discover parts by factorizing the scene motion as a composition of part-level rigid motions. Experimentally, our method can achieve fast and high-quality dynamic scene reconstruction from even a single moving camera, and the induced part-based representation allows direct applications of part tracking, animation, 3D scene editing, etc.

* 10 pages 
Viaarxiv icon

NeRFusion: Fusing Radiance Fields for Large-Scale Scene Reconstruction

Mar 21, 2022
Xiaoshuai Zhang, Sai Bi, Kalyan Sunkavalli, Hao Su, Zexiang Xu

Figure 1 for NeRFusion: Fusing Radiance Fields for Large-Scale Scene Reconstruction
Figure 2 for NeRFusion: Fusing Radiance Fields for Large-Scale Scene Reconstruction
Figure 3 for NeRFusion: Fusing Radiance Fields for Large-Scale Scene Reconstruction
Figure 4 for NeRFusion: Fusing Radiance Fields for Large-Scale Scene Reconstruction

While NeRF has shown great success for neural reconstruction and rendering, its limited MLP capacity and long per-scene optimization times make it challenging to model large-scale indoor scenes. In contrast, classical 3D reconstruction methods can handle large-scale scenes but do not produce realistic renderings. We propose NeRFusion, a method that combines the advantages of NeRF and TSDF-based fusion techniques to achieve efficient large-scale reconstruction and photo-realistic rendering. We process the input image sequence to predict per-frame local radiance fields via direct network inference. These are then fused using a novel recurrent neural network that incrementally reconstructs a global, sparse scene representation in real-time at 22 fps. This global volume can be further fine-tuned to boost rendering quality. We demonstrate that NeRFusion achieves state-of-the-art quality on both large-scale indoor and small-scale object scenes, with substantially faster reconstruction than NeRF and other recent methods.

* CVPR 2022 
Viaarxiv icon

Close the Visual Domain Gap by Physics-Grounded Active Stereovision Depth Sensor Simulation

Feb 07, 2022
Xiaoshuai Zhang, Rui Chen, Fanbo Xiang, Yuzhe Qin, Jiayuan Gu, Zhan Ling, Minghua Liu, Peiyu Zeng, Songfang Han, Zhiao Huang, Tongzhou Mu, Jing Xu, Hao Su

Figure 1 for Close the Visual Domain Gap by Physics-Grounded Active Stereovision Depth Sensor Simulation
Figure 2 for Close the Visual Domain Gap by Physics-Grounded Active Stereovision Depth Sensor Simulation
Figure 3 for Close the Visual Domain Gap by Physics-Grounded Active Stereovision Depth Sensor Simulation
Figure 4 for Close the Visual Domain Gap by Physics-Grounded Active Stereovision Depth Sensor Simulation

In this paper, we focus on the simulation of active stereovision depth sensors, which are popular in both academic and industry communities. Inspired by the underlying mechanism of the sensors, we designed a fully physics-grounded simulation pipeline, which includes material acquisition, ray tracing based infrared (IR) image rendering, IR noise simulation, and depth estimation. The pipeline is able to generate depth maps with material-dependent error patterns similar to a real depth sensor. We conduct extensive experiments to show that perception algorithms and reinforcement learning policies trained in our simulation platform could transfer well to real world test cases without any fine-tuning. Furthermore, due to the high degree of realism of this simulation, our depth sensor simulator can be used as a convenient testbed to evaluate the algorithm performance in the real world, which will largely reduce the human effort in developing robotic algorithms. The entire pipeline has been integrated into the SAPIEN simulator and is open-sourced to promote the research of vision and robotics communities.

* 20 pages, 15 figures, 10 tables 
Viaarxiv icon

ActiveZero: Mixed Domain Learning for Active Stereovision with Zero Annotation

Dec 06, 2021
Isabella Liu, Edward Yang, Jianyu Tao, Rui Chen, Xiaoshuai Zhang, Qing Ran, Zhu Liu, Hao Su

Figure 1 for ActiveZero: Mixed Domain Learning for Active Stereovision with Zero Annotation
Figure 2 for ActiveZero: Mixed Domain Learning for Active Stereovision with Zero Annotation
Figure 3 for ActiveZero: Mixed Domain Learning for Active Stereovision with Zero Annotation
Figure 4 for ActiveZero: Mixed Domain Learning for Active Stereovision with Zero Annotation

Traditional depth sensors generate accurate real world depth estimates that surpass even the most advanced learning approaches trained only on simulation domains. Since ground truth depth is readily available in the simulation domain but quite difficult to obtain in the real domain, we propose a method that leverages the best of both worlds. In this paper we present a new framework, ActiveZero, which is a mixed domain learning solution for active stereovision systems that requires no real world depth annotation. First, we demonstrate the transferability of our method to out-of-distribution real data by using a mixed domain learning strategy. In the simulation domain, we use a combination of supervised disparity loss and self-supervised losses on a shape primitives dataset. By contrast, in the real domain, we only use self-supervised losses on a dataset that is out-of-distribution from either training simulation data or test real data. Second, our method introduces a novel self-supervised loss called temporal IR reprojection to increase the robustness and accuracy of our reprojections in hard-to-perceive regions. Finally, we show how the method can be trained end-to-end and that each module is important for attaining the end result. Extensive qualitative and quantitative evaluations on real data demonstrate state of the art results that can even beat a commercial depth sensor.

Viaarxiv icon

Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models

Oct 06, 2021
Jen-Hao Rick Chang, Ashish Shrivastava, Hema Swetha Koppula, Xiaoshuai Zhang, Oncel Tuzel

Figure 1 for Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models
Figure 2 for Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models
Figure 3 for Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models
Figure 4 for Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models

Controllable generative sequence models with the capability to extract and replicate the style of specific examples enable many applications, including narrating audiobooks in different voices, auto-completing and auto-correcting written handwriting, and generating missing training samples for downstream recognition tasks. However, typical training algorithms for these controllable sequence generative models suffer from the training-inference mismatch, where the same sample is used as content and style input during training but different samples are given during inference. In this paper, we tackle the training-inference mismatch encountered during unsupervised learning of controllable generative sequence models. By introducing a style transformation module that we call style equalization, we enable training using different content and style samples and thereby mitigate the training-inference mismatch. To demonstrate its generality, we applied style equalization to text-to-speech and text-to-handwriting synthesis on three datasets. Our models achieve state-of-the-art style replication with a similar mean style opinion score as the real data. Moreover, the proposed method enables style interpolation between sequences and generates novel styles.

Viaarxiv icon

MVSNeRF: Fast Generalizable Radiance Field Reconstruction from Multi-View Stereo

Mar 29, 2021
Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, Hao Su

Figure 1 for MVSNeRF: Fast Generalizable Radiance Field Reconstruction from Multi-View Stereo
Figure 2 for MVSNeRF: Fast Generalizable Radiance Field Reconstruction from Multi-View Stereo
Figure 3 for MVSNeRF: Fast Generalizable Radiance Field Reconstruction from Multi-View Stereo
Figure 4 for MVSNeRF: Fast Generalizable Radiance Field Reconstruction from Multi-View Stereo

We present MVSNeRF, a novel neural rendering approach that can efficiently reconstruct neural radiance fields for view synthesis. Unlike prior works on neural radiance fields that consider per-scene optimization on densely captured images, we propose a generic deep neural network that can reconstruct radiance fields from only three nearby input views via fast network inference. Our approach leverages plane-swept cost volumes (widely used in multi-view stereo) for geometry-aware scene reasoning, and combines this with physically based volume rendering for neural radiance field reconstruction. We train our network on real objects in the DTU dataset, and test it on three different datasets to evaluate its effectiveness and generalizability. Our approach can generalize across scenes (even indoor scenes, completely different from our training scenes of objects) and generate realistic view synthesis results using only three input images, significantly outperforming concurrent works on generalizable radiance field reconstruction. Moreover, if dense images are captured, our estimated radiance field representation can be easily fine-tuned; this leads to fast per-scene reconstruction with higher rendering quality and substantially less optimization time than NeRF.

Viaarxiv icon

Meshing Point Clouds with Predicted Intrinsic-Extrinsic Ratio Guidance

Jul 17, 2020
Minghua Liu, Xiaoshuai Zhang, Hao Su

Figure 1 for Meshing Point Clouds with Predicted Intrinsic-Extrinsic Ratio Guidance
Figure 2 for Meshing Point Clouds with Predicted Intrinsic-Extrinsic Ratio Guidance
Figure 3 for Meshing Point Clouds with Predicted Intrinsic-Extrinsic Ratio Guidance
Figure 4 for Meshing Point Clouds with Predicted Intrinsic-Extrinsic Ratio Guidance

We are interested in reconstructing the mesh representation of object surfaces from point clouds. Surface reconstruction is a prerequisite for downstream applications such as rendering, collision avoidance for planning, animation, etc. However, the task is challenging if the input point cloud has a low resolution, which is common in real-world scenarios (e.g., from LiDAR or Kinect sensors). Existing learning-based mesh generative methods mostly predict the surface by first building a shape embedding that is at the whole object level, a design that causes issues in generating fine-grained details and generalizing to unseen categories. Instead, we propose to leverage the input point cloud as much as possible, by only adding connectivity information to existing points. Particularly, we predict which triplets of points should form faces. Our key innovation is a surrogate of local connectivity, calculated by comparing the intrinsic/extrinsic metrics. We learn to predict this surrogate using a deep point cloud network and then feed it to an efficient post-processing module for high-quality mesh generation. We demonstrate that our method can not only preserve details, handle ambiguous structures, but also possess strong generalizability to unseen categories by experiments on synthetic and real data. \keywords{mesh reconstruction, point cloud

* ECCV 2020 
Viaarxiv icon

Disentangled Image Matting

Sep 10, 2019
Shaofan Cai, Xiaoshuai Zhang, Haoqiang Fan, Haibin Huang, Jiangyu Liu, Jiaming Liu, Jiaying Liu, Jue Wang, Jian Sun

Figure 1 for Disentangled Image Matting
Figure 2 for Disentangled Image Matting
Figure 3 for Disentangled Image Matting
Figure 4 for Disentangled Image Matting

Most previous image matting methods require a roughly-specificed trimap as input, and estimate fractional alpha values for all pixels that are in the unknown region of the trimap. In this paper, we argue that directly estimating the alpha matte from a coarse trimap is a major limitation of previous methods, as this practice tries to address two difficult and inherently different problems at the same time: identifying true blending pixels inside the trimap region, and estimate accurate alpha values for them. We propose AdaMatting, a new end-to-end matting framework that disentangles this problem into two sub-tasks: trimap adaptation and alpha estimation. Trimap adaptation is a pixel-wise classification problem that infers the global structure of the input image by identifying definite foreground, background, and semi-transparent image regions. Alpha estimation is a regression problem that calculates the opacity value of each blended pixel. Our method separately handles these two sub-tasks within a single deep convolutional neural network (CNN). Extensive experiments show that AdaMatting has additional structure awareness and trimap fault-tolerance. Our method achieves the state-of-the-art performance on Adobe Composition-1k dataset both qualitatively and quantitatively. It is also the current best-performing method on the alphamatting.com online evaluation for all commonly-used metrics.

Viaarxiv icon