Score distillation sampling~(SDS) has been widely adopted to overcome the absence of unseen views in reconstructing 3D objects from a \textbf{single} image. It leverages pre-trained 2D diffusion models as teacher to guide the reconstruction of student 3D models. Despite their remarkable success, SDS-based methods often encounter geometric artifacts and texture saturation. We find out the crux is the overlooked indiscriminate treatment of diffusion time-steps during optimization: it unreasonably treats the student-teacher knowledge distillation to be equal at all time-steps and thus entangles coarse-grained and fine-grained modeling. Therefore, we propose the Diffusion Time-step Curriculum one-image-to-3D pipeline (DTC123), which involves both the teacher and student models collaborating with the time-step curriculum in a coarse-to-fine manner. Extensive experiments on NeRF4, RealFusion15, GSO and Level50 benchmark demonstrate that DTC123 can produce multi-view consistent, high-quality, and diverse 3D assets. Codes and more generation demos will be released in https://github.com/yxymessi/DTC123.
Text-to-3D generation has shown great promise in generating novel 3D content based on given text prompts. However, existing generative methods mostly focus on geometric or visual plausibility while ignoring precise physics perception for the generated 3D shapes. This greatly hinders the practicality of generated 3D shapes in real-world applications. In this work, we propose Phy3DGen, a precise-physics-driven text-to-3D generation method. By analyzing the solid mechanics of generated 3D shapes, we reveal that the 3D shapes generated by existing text-to-3D generation methods are impractical for real-world applications as the generated 3D shapes do not conform to the laws of physics. To this end, we leverage 3D diffusion models to provide 3D shape priors and design a data-driven differentiable physics layer to optimize 3D shape priors with solid mechanics. This allows us to optimize geometry efficiently and learn precise physics information about 3D shapes at the same time. Experimental results demonstrate that our method can consider both geometric plausibility and precise physics perception, further bridging 3D virtual modeling and precise physical worlds.
Coordinate-based neural implicit representation or implicit fields have been widely studied for 3D geometry representation or novel view synthesis. Recently, a series of efforts have been devoted to accelerating the speed and improving the quality of the coordinate-based implicit field learning. Instead of learning heavy MLPs to predict the neural implicit values for the query coordinates, neural voxels or grids combined with shallow MLPs have been proposed to achieve high-quality implicit field learning with reduced optimization time. On the other hand, lightweight field representations such as linear grid have been proposed to further improve the learning speed. In this paper, we aim for both fast and high-quality implicit field learning, and propose TaylorGrid, a novel implicit field representation which can be efficiently computed via direct Taylor expansion optimization on 2D or 3D grids. As a general representation, TaylorGrid can be adapted to different implicit fields learning tasks such as SDF learning or NeRF. From extensive quantitative and qualitative comparisons, TaylorGrid achieves a balance between the linear grid and neural voxels, showing its superiority in fast and high-quality implicit field learning.
Surface reconstruction has traditionally relied on the Multi-View Stereo (MVS)-based pipeline, which often suffers from noisy and incomplete geometry. This is due to that although MVS has been proven to be an effective way to recover the geometry of the scenes, especially for locally detailed areas with rich textures, it struggles to deal with areas with low texture and large variations of illumination where the photometric consistency is unreliable. Recently, Neural Implicit Surface Reconstruction (NISR) combines surface rendering and volume rendering techniques and bypasses the MVS as an intermediate step, which has emerged as a promising alternative to overcome the limitations of traditional pipelines. While NISR has shown impressive results on simple scenes, it remains challenging to recover delicate geometry from uncontrolled real-world scenes which is caused by its underconstrained optimization. To this end, the framework PSDF is proposed which resorts to external geometric priors from a pretrained MVS network and internal geometric priors inherent in the NISR model to facilitate high-quality neural implicit surface learning. Specifically, the visibility-aware feature consistency loss and depth prior-assisted sampling based on external geometric priors are introduced. These proposals provide powerfully geometric consistency constraints and aid in locating surface intersection points, thereby significantly improving the accuracy and delicate reconstruction of NISR. Meanwhile, the internal prior-guided importance rendering is presented to enhance the fidelity of the reconstructed surface mesh by mitigating the biased rendering issue in NISR. Extensive experiments on the Tanks and Temples dataset show that PSDF achieves state-of-the-art performance on complex uncontrolled scenes.
Neural surfaces learning has shown impressive performance in multi-view surface reconstruction. However, most existing methods use large multilayer perceptrons (MLPs) to train their models from scratch, resulting in hours of training for a single scene. Recently, how to accelerate the neural surfaces learning has received a lot of attention and remains an open problem. In this work, we propose a prior-based residual learning paradigm for fast multi-view neural surface reconstruction. This paradigm consists of two optimization stages. In the first stage, we propose to leverage generalization models to generate a basis signed distance function (SDF) field. This initial field can be quickly obtained by fusing multiple local SDF fields produced by generalization models. This provides a coarse global geometry prior. Based on this prior, in the second stage, a fast residual learning strategy based on hash-encoding networks is proposed to encode an offset SDF field for the basis SDF field. Moreover, we introduce a prior-guided sampling scheme to help the residual learning stage converge better, and thus recover finer structures. With our designed paradigm, experimental results show that our method only takes about 3 minutes to reconstruct the surface of a single scene, while achieving competitive surface quality. Our code will be released upon publication.
Point cloud registration, a fundamental task in 3D computer vision, has remained largely unexplored in cross-source point clouds and unstructured scenes. The primary challenges arise from noise, outliers, and variations in scale and density. However, neglected geometric natures of point clouds restricts the performance of current methods. In this paper, we propose a novel method termed SPEAL to leverage skeletal representations for effective learning of intrinsic topologies of point clouds, facilitating robust capture of geometric intricacy. Specifically, we design the Skeleton Extraction Module to extract skeleton points and skeletal features in an unsupervised manner, which is inherently robust to noise and density variances. Then, we propose the Skeleton-Aware GeoTransformer to encode high-level skeleton-aware features. It explicitly captures the topological natures and inter-point-cloud skeletal correlations with the noise-robust and density-invariant skeletal representations. Next, we introduce the Correspondence Dual-Sampler to facilitate correspondences by augmenting the correspondence set with skeletal correspondences. Furthermore, we construct a challenging novel large-scale cross-source point cloud dataset named KITTI CrossSource for benchmarking cross-source point cloud registration methods. Extensive quantitative and qualitative experiments are conducted to demonstrate our approach's superiority and robustness on both cross-source and same-source datasets. To the best of our knowledge, our approach is the first to facilitate point cloud registration with skeletal geometric priors.
As a fundamental problem in computer vision, multi-view stereo (MVS) aims at recovering the 3D geometry of a target from a set of 2D images. Recent advances in MVS have shown that it is important to perceive non-local structured information for recovering geometry in low-textured areas. In this work, we propose a Hierarchical Prior Mining for Non-local Multi-View Stereo (HPM-MVS). The key characteristics are the following techniques that exploit non-local information to assist MVS: 1) A Non-local Extensible Sampling Pattern (NESP), which is able to adaptively change the size of sampled areas without becoming snared in locally optimal solutions. 2) A new approach to leverage non-local reliable points and construct a planar prior model based on K-Nearest Neighbor (KNN), to obtain potential hypotheses for the regions where prior construction is challenging. 3) A Hierarchical Prior Mining (HPM) framework, which is used to mine extensive non-local prior information at different scales to assist 3D model recovery, this strategy can achieve a considerable balance between the reconstruction of details and low-textured areas. Experimental results on the ETH3D and Tanks \& Temples have verified the superior performance and strong generalization capability of our method. Our code will be released.
Recently, neural implicit surfaces learning by volume rendering has become popular for multi-view reconstruction. However, one key challenge remains: existing approaches lack explicit multi-view geometry constraints, hence usually fail to generate geometry consistent surface reconstruction. To address this challenge, we propose geometry-consistent neural implicit surfaces learning for multi-view reconstruction. We theoretically analyze that there exists a gap between the volume rendering integral and point-based signed distance function (SDF) modeling. To bridge this gap, we directly locate the zero-level set of SDF networks and explicitly perform multi-view geometry optimization by leveraging the sparse geometry from structure from motion (SFM) and photometric consistency in multi-view stereo. This makes our SDF optimization unbiased and allows the multi-view geometry constraints to focus on the true surface optimization. Extensive experiments show that our proposed method achieves high-quality surface reconstruction in both complex thin structures and large smooth regions, thus outperforming the state-of-the-arts by a large margin.
In deep multi-view stereo networks, cost regularization is crucial to achieve accurate depth estimation. Since 3D cost volume filtering is usually memory-consuming, recurrent 2D cost map regularization has recently become popular and has shown great potential in reconstructing 3D models of different scales. However, existing recurrent methods only model the local dependencies in the depth domain, which greatly limits the capability of capturing the global scene context along the depth dimension. To tackle this limitation, we propose a novel non-local recurrent regularization network for multi-view stereo, named NR2-Net. Specifically, we design a depth attention module to capture non-local depth interactions within a sliding depth block. Then, the global scene context between different blocks is modeled in a gated recurrent manner. This way, the long-range dependencies along the depth dimension are captured to facilitate the cost regularization. Moreover, we design a dynamic depth map fusion strategy to improve the algorithm robustness. Our method achieves state-of-the-art reconstruction results on both DTU and Tanks and Temples datasets.
Recently, learning-based multi-view stereo methods have achieved promising results. However, they all overlook the visibility difference among different views, which leads to an indiscriminate multi-view similarity definition and greatly limits their performance on datasets with strong viewpoint variations. In this paper, a Pixelwise Visibility-aware multi-view Stereo Network (PVSNet) is proposed for robust dense 3D reconstruction. We present a pixelwise visibility network to learn the visibility information for different neighboring images before computing the multi-view similarity, and then construct an adaptive weighted cost volume with the visibility information. Moreover, we present an anti-noise training strategy that introduces disturbing views during model training to make the pixelwise visibility network more distinguishable to unrelated views, which is different with the existing learning methods that only use two best neighboring views for training. To the best of our knowledge, PVSNet is the first deep learning framework that is able to capture the visibility information of different neighboring views. In this way, our method can be generalized well to different types of datasets, especially the ETH3D high-res benchmark with strong viewpoint variations. Extensive experiments show that PVSNet achieves the state-of-the-art performance on different datasets.