Multi-instance registration is a challenging problem in computer vision and robotics, where multiple instances of an object need to be registered in a standard coordinate system. In this work, we propose the first iterative framework called instance-by-instance (IBI) for multi-instance 3D registration (MI-3DReg). It successively registers all instances in a given scenario, starting from the easiest and progressing to more challenging ones. Throughout the iterative process, outliers are eliminated continuously, leading to an increasing inlier rate for the remaining and more challenging instances. Under the IBI framework, we further propose a sparse-to-dense-correspondence-based multi-instance registration method (IBI-S2DC) to achieve robust MI-3DReg. Experiments on the synthetic and real datasets have demonstrated the effectiveness of IBI and suggested the new state-of-the-art performance of IBI-S2DC, e.g., our MHF1 is 12.02%/12.35% higher than the existing state-of-the-art method ECC on the synthetic/real datasets.
One-shot 3D talking portrait generation aims to reconstruct a 3D avatar from an unseen image, and then animate it with a reference video or audio to generate a talking portrait video. The existing methods fail to simultaneously achieve the goals of accurate 3D avatar reconstruction and stable talking face animation. Besides, while the existing works mainly focus on synthesizing the head part, it is also vital to generate natural torso and background segments to obtain a realistic talking portrait video. To address these limitations, we present Real3D-Potrait, a framework that (1) improves the one-shot 3D reconstruction power with a large image-to-plane model that distills 3D prior knowledge from a 3D face generative model; (2) facilitates accurate motion-conditioned animation with an efficient motion adapter; (3) synthesizes realistic video with natural torso movement and switchable background using a head-torso-background super-resolution model; and (4) supports one-shot audio-driven talking face generation with a generalizable audio-to-motion model. Extensive experiments show that Real3D-Portrait generalizes well to unseen identities and generates more realistic talking portrait videos compared to previous methods. Video samples and source code are available at https://real3dportrait.github.io .
Multi-perspective cameras with potentially non-overlapping fields of view have become an important exteroceptive sensing modality in a number of applications such as intelligent vehicles, drones, and mixed reality headsets. In this work, we challenge one of the basic assumptions made in these scenarios, which is that the multi-camera rig is rigid. More specifically, we are considering the problem of estimating the relative pose between a static non-rigid rig in different spatial orientations while taking into account the effect of gravity onto the system. The deformable physical connections between each camera and the body center are approximated by a simple cantilever model, and inserted into the generalized epipolar constraint. Our results lead us to the important insight that the latent parameters of the deformation model, meaning the gravity vector in both views, become observable. We present a concise analysis of the observability of all variables based on noise, outliers, and rig rigidity for two different algorithms. The first one is a vision-only alternative, while the second one makes use of additional gravity measurements. To conclude, we demonstrate the ability to sense gravity in a real-world example, and discuss practical implications.
Data collected by different modalities can provide a wealth of complementary information, such as hyperspectral image (HSI) to offer rich spectral-spatial properties, synthetic aperture radar (SAR) to provide structural information about the Earth's surface, and light detection and ranging (LiDAR) to cover altitude information about ground elevation. Therefore, a natural idea is to combine multimodal images for refined and accurate land-cover interpretation. Although many efforts have been attempted to achieve multi-source remote sensing image classification, there are still three issues as follows: 1) indiscriminate feature representation without sufficiently considering modal heterogeneity, 2) abundant features and complex computations associated with modeling long-range dependencies, and 3) overfitting phenomenon caused by sparsely labeled samples. To overcome the above barriers, a transformer-based heterogeneously salient graph representation (THSGR) approach is proposed in this paper. First, a multimodal heterogeneous graph encoder is presented to encode distinctively non-Euclidean structural features from heterogeneous data. Then, a self-attention-free multi-convolutional modulator is designed for effective and efficient long-term dependency modeling. Finally, a mean forward is put forward in order to avoid overfitting. Based on the above structures, the proposed model is able to break through modal gaps to obtain differentiated graph representation with competitive time cost, even for a small fraction of training samples. Experiments and analyses on three benchmark datasets with various state-of-the-art (SOTA) methods show the performance of the proposed approach.
We propose a novel framework for RGB-based category-level 6D object pose and size estimation. Our approach relies on the prediction of normalized object coordinate space (NOCS), which serves as an efficient and effective object canonical representation that can be extracted from RGB images. Unlike previous approaches that heavily relied on additional depth readings as input, our novelty lies in leveraging multi-view information, which is commonly available in practical scenarios where a moving camera continuously observes the environment. By introducing multi-view constraints, we can obtain accurate camera pose and depth estimation from a monocular dense SLAM framework. Additionally, by incorporating constraints on the camera relative pose, we can apply trimming strategies and robust pose averaging on the multi-view object poses, resulting in more accurate and robust estimations of category-level object poses even in the absence of direct depth readings. Furthermore, we introduce a novel NOCS prediction network that significantly improves performance. Our experimental results demonstrate the strong performance of our proposed method, even comparable to state-of-the-art RGB-D methods across public dataset sequences. Additionally, we showcase the generalization ability of our method by evaluating it on self-collected datasets.
Dynamic vision sensors or event cameras provide rich complementary information for video frame interpolation. Existing state-of-the-art methods follow the paradigm of combining both synthesis-based and warping networks. However, few of those methods fully respect the intrinsic characteristics of events streams. Given that event cameras only encode intensity changes and polarity rather than color intensities, estimating optical flow from events is arguably more difficult than from RGB information. We therefore propose to incorporate RGB information in an event-guided optical flow refinement strategy. Moreover, in light of the quasi-continuous nature of the time signals provided by event cameras, we propose a divide-and-conquer strategy in which event-based intermediate frame synthesis happens incrementally in multiple simplified stages rather than in a single, long stage. Extensive experiments on both synthetic and real-world datasets show that these modifications lead to more reliable and realistic intermediate frame results than previous video frame interpolation methods. Our findings underline that a careful consideration of event characteristics such as high temporal density and elevated noise benefits interpolation accuracy.
As a fundamental problem in computer vision, 3D point cloud registration (PCR) aims to seek the optimal pose to align a point cloud pair. In this paper, we present a 3D registration method with maximal cliques (MAC). The key insight is to loosen the previous maximum clique constraint, and mine more local consensus information in a graph for accurate pose hypotheses generation: 1) A compatibility graph is constructed to render the affinity relationship between initial correspondences. 2) We search for maximal cliques in the graph, each of which represents a consensus set. We perform node-guided clique selection then, where each node corresponds to the maximal clique with the greatest graph weight. 3) Transformation hypotheses are computed for the selected cliques by the SVD algorithm and the best hypothesis is used to perform registration. Extensive experiments on U3M, 3DMatch, 3DLoMatch and KITTI demonstrate that MAC effectively increases registration accuracy, outperforms various state-of-the-art methods and boosts the performance of deep-learned methods. MAC combined with deep-learned methods achieves state-of-the-art registration recall of 95.7% / 78.9% on 3DMatch / 3DLoMatch.
As a fundamental problem in computer vision, multi-view stereo (MVS) aims at recovering the 3D geometry of a target from a set of 2D images. Recent advances in MVS have shown that it is important to perceive non-local structured information for recovering geometry in low-textured areas. In this work, we propose a Hierarchical Prior Mining for Non-local Multi-View Stereo (HPM-MVS). The key characteristics are the following techniques that exploit non-local information to assist MVS: 1) A Non-local Extensible Sampling Pattern (NESP), which is able to adaptively change the size of sampled areas without becoming snared in locally optimal solutions. 2) A new approach to leverage non-local reliable points and construct a planar prior model based on K-Nearest Neighbor (KNN), to obtain potential hypotheses for the regions where prior construction is challenging. 3) A Hierarchical Prior Mining (HPM) framework, which is used to mine extensive non-local prior information at different scales to assist 3D model recovery, this strategy can achieve a considerable balance between the reconstruction of details and low-textured areas. Experimental results on the ETH3D and Tanks \& Temples have verified the superior performance and strong generalization capability of our method. Our code will be released.
3D single object tracking (SOT) is an indispensable part of automated driving. Existing approaches rely heavily on large, densely labeled datasets. However, annotating point clouds is both costly and time-consuming. Inspired by the great success of cycle tracking in unsupervised 2D SOT, we introduce the first semi-supervised approach to 3D SOT. Specifically, we introduce two cycle-consistency strategies for supervision: 1) Self tracking cycles, which leverage labels to help the model converge better in the early stages of training; 2) forward-backward cycles, which strengthen the tracker's robustness to motion variations and the template noise caused by the template update strategy. Furthermore, we propose a data augmentation strategy named SOTMixup to improve the tracker's robustness to point cloud diversity. SOTMixup generates training samples by sampling points in two point clouds with a mixing rate and assigns a reasonable loss weight for training according to the mixing rate. The resulting MixCycle approach generalizes to appearance matching-based trackers. On the KITTI benchmark, based on the P2B tracker, MixCycle trained with $\textbf{10%}$ labels outperforms P2B trained with $\textbf{100%}$ labels, and achieves a $\textbf{28.4%}$ precision improvement when using $\textbf{1%}$ labels. Our code will be publicly released.
As the gold standard for phase retrieval, phase-shifting algorithm (PS) has been widely used in optical interferometry, fringe projection profilometry, etc. However, capturing multiple fringe patterns in PS limits the algorithm to only a narrow range of application. To this end, a deep learning (DL) model based digital PS algorithm from only a single fringe image is proposed. By training on a simulated dataset of PS fringe patterns, the learnt model, denoted PSNet, can predict fringe patterns with other PS steps when given a pattern with the first PS step. Simulation and experiment results demonstrate the PSNet's promising performance on accurate prediction of digital PS patterns, and robustness to complex scenarios such as surfaces with varying curvature and reflectance.