Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Long Quan

Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency

Jul 24, 2020

Jiaxiang Shang, Tianwei Shen, Shiwei Li, Lei Zhou, Mingmin Zhen, Tian Fang, Long Quan

Figure 1 for Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency

Figure 2 for Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency

Figure 3 for Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency

Figure 4 for Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency

Abstract:Recent learning-based approaches, in which models are trained by single-view images have shown promising results for monocular 3D face reconstruction, but they suffer from the ill-posed face pose and depth ambiguity issue. In contrast to previous works that only enforce 2D feature constraints, we propose a self-supervised training architecture by leveraging the multi-view geometry consistency, which provides reliable constraints on face pose and depth estimation. We first propose an occlusion-aware view synthesis method to apply multi-view geometry consistency to self-supervised learning. Then we design three novel loss functions for multi-view consistency, including the pixel consistency loss, the depth consistency loss, and the facial landmark-based epipolar loss. Our method is accurate and robust, especially under large variations of expressions, poses, and illumination conditions. Comprehensive experiments on the face alignment and 3D face reconstruction benchmarks have demonstrated superiority over state-of-the-art methods. Our code and model are released in https://github.com/jiaxiangshang/MGCNet.

* Accepted to ECCV 2020, supplementary materials included

Via

Access Paper or Ask Questions

ASLFeat: Learning Local Features of Accurate Shape and Localization

Apr 19, 2020

Zixin Luo, Lei Zhou, Xuyang Bai, Hongkai Chen, Jiahui Zhang, Yao Yao, Shiwei Li, Tian Fang, Long Quan

Figure 1 for ASLFeat: Learning Local Features of Accurate Shape and Localization

Figure 2 for ASLFeat: Learning Local Features of Accurate Shape and Localization

Figure 3 for ASLFeat: Learning Local Features of Accurate Shape and Localization

Figure 4 for ASLFeat: Learning Local Features of Accurate Shape and Localization

Abstract:This work focuses on mitigating two limitations in the joint learning of local feature detectors and descriptors. First, the ability to estimate the local shape (scale, orientation, etc.) of feature points is often neglected during dense feature extraction, while the shape-awareness is crucial to acquire stronger geometric invariance. Second, the localization accuracy of detected keypoints is not sufficient to reliably recover camera geometry, which has become the bottleneck in tasks such as 3D reconstruction. In this paper, we present ASLFeat, with three light-weight yet effective modifications to mitigate above issues. First, we resort to deformable convolutional networks to densely estimate and apply local transformation. Second, we take advantage of the inherent feature hierarchy to restore spatial resolution and low-level details for accurate keypoint localization. Finally, we use a peakiness measurement to relate feature responses and derive more indicative detection scores. The effect of each modification is thoroughly studied, and the evaluation is extensively conducted across a variety of practical scenarios. State-of-the-art results are reported that demonstrate the superiority of our methods.

* Accepted to CVPR 2020, supplementary materials included, code available

Via

Access Paper or Ask Questions

KFNet: Learning Temporal Camera Relocalization using Kalman Filtering

Mar 24, 2020

Lei Zhou, Zixin Luo, Tianwei Shen, Jiahui Zhang, Mingmin Zhen, Yao Yao, Tian Fang, Long Quan

Figure 1 for KFNet: Learning Temporal Camera Relocalization using Kalman Filtering

Figure 2 for KFNet: Learning Temporal Camera Relocalization using Kalman Filtering

Figure 3 for KFNet: Learning Temporal Camera Relocalization using Kalman Filtering

Figure 4 for KFNet: Learning Temporal Camera Relocalization using Kalman Filtering

Abstract:Temporal camera relocalization estimates the pose with respect to each video frame in sequence, as opposed to one-shot relocalization which focuses on a still image. Even though the time dependency has been taken into account, current temporal relocalization methods still generally underperform the state-of-the-art one-shot approaches in terms of accuracy. In this work, we improve the temporal relocalization method by using a network architecture that incorporates Kalman filtering (KFNet) for online camera relocalization. In particular, KFNet extends the scene coordinate regression problem to the time domain in order to recursively establish 2D and 3D correspondences for the pose determination. The network architecture design and the loss formulation are based on Kalman filtering in the context of Bayesian learning. Extensive experiments on multiple relocalization benchmarks demonstrate the high accuracy of KFNet at the top of both one-shot and temporal relocalization approaches. Our codes are released at https://github.com/zlthinker/KFNet.

* An oral paper of CVPR 2020

Via

Access Paper or Ask Questions

D3Feat: Joint Learning of Dense Detection and Description of 3D Local Features

Mar 06, 2020

Xuyang Bai, Zixin Luo, Lei Zhou, Hongbo Fu, Long Quan, Chiew-Lan Tai

Figure 1 for D3Feat: Joint Learning of Dense Detection and Description of 3D Local Features

Figure 2 for D3Feat: Joint Learning of Dense Detection and Description of 3D Local Features

Figure 3 for D3Feat: Joint Learning of Dense Detection and Description of 3D Local Features

Figure 4 for D3Feat: Joint Learning of Dense Detection and Description of 3D Local Features

Abstract:A successful point cloud registration often lies on robust establishment of sparse matches through discriminative 3D local features. Despite the fast evolution of learning-based 3D feature descriptors, little attention has been drawn to the learning of 3D feature detectors, even less for a joint learning of the two tasks. In this paper, we leverage a 3D fully convolutional network for 3D point clouds, and propose a novel and practical learning mechanism that densely predicts both a detection score and a description feature for each 3D point. In particular, we propose a keypoint selection strategy that overcomes the inherent density variations of 3D point clouds, and further propose a self-supervised detector loss guided by the on-the-fly feature matching results during training. Finally, our method achieves state-of-the-art results in both indoor and outdoor scenarios, evaluated on 3DMatch and KITTI datasets, and shows its strong generalization ability on the ETH dataset. Towards practical use, we show that by adopting a reliable feature detector, sampling a smaller number of features is sufficient to achieve accurate and fast point cloud alignment.[code release](https://github.com/XuyangBai/D3Feat)

* Accepted to CVPR 2020, supplementary materials included

Via

Access Paper or Ask Questions

BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks

Nov 22, 2019

Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, Long Quan

Figure 1 for BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks

Figure 2 for BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks

Figure 3 for BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks

Figure 4 for BlendedMVS: A Large-scale Dataset for Generalized Multi-view Stereo Networks

Abstract:While deep learning has recently achieved great success on multi-view stereo (MVS), limited training data makes the trained model hard to be generalized to unseen scenarios. Compared with other computer vision tasks, it is rather difficult to collect a large-scale MVS dataset as it requires expensive active scanners and labor-intensive process to obtain ground truth 3D structures. In this paper, we introduce BlendedMVS, a novel large-scale dataset, to provide sufficient training ground truth for learning-based MVS. To create the dataset, we apply a 3D reconstruction pipeline to recover high-quality textured meshes from images of well-selected scenes. Then, we render these mesh models to color images and depth maps. The rendered color images are further blended with the input images to generate photo-realistic blended images as the training input. Our dataset contains over 17k high-resolution images covering a variety of scenes, including cities, architectures, sculptures and small objects. Extensive experiments demonstrate that BlendedMVS endows the trained model with significantly better generalization ability compared with other MVS datasets. The entire dataset with pretrained models will be made publicly available at https://github.com/YoYo000/BlendedMVS.

Via

Access Paper or Ask Questions

Self-Supervised Learning of Depth and Motion Under Photometric Inconsistency

Sep 19, 2019

Tianwei Shen, Lei Zhou, Zixin Luo, Yao Yao, Shiwei Li, Jiahui Zhang, Tian Fang, Long Quan

Figure 1 for Self-Supervised Learning of Depth and Motion Under Photometric Inconsistency

Figure 2 for Self-Supervised Learning of Depth and Motion Under Photometric Inconsistency

Figure 3 for Self-Supervised Learning of Depth and Motion Under Photometric Inconsistency

Figure 4 for Self-Supervised Learning of Depth and Motion Under Photometric Inconsistency

Abstract:The self-supervised learning of depth and pose from monocular sequences provides an attractive solution by using the photometric consistency of nearby frames as it depends much less on the ground-truth data. In this paper, we address the issue when previous assumptions of the self-supervised approaches are violated due to the dynamic nature of real-world scenes. Different from handling the noise as uncertainty, our key idea is to incorporate more robust geometric quantities and enforce internal consistency in the temporal image sequence. As demonstrated on commonly used benchmark datasets, the proposed method substantially improves the state-of-the-art methods on both depth and relative pose estimation for monocular image sequences, without adding inference overhead.

* International Conference on Computer Vision (ICCV) Workshop 2019

Via

Access Paper or Ask Questions

Learning Two-View Correspondences and Geometry Using Order-Aware Network

Aug 14, 2019

Jiahui Zhang, Dawei Sun, Zixin Luo, Anbang Yao, Lei Zhou, Tianwei Shen, Yurong Chen, Long Quan, Hongen Liao

Figure 1 for Learning Two-View Correspondences and Geometry Using Order-Aware Network

Figure 2 for Learning Two-View Correspondences and Geometry Using Order-Aware Network

Figure 3 for Learning Two-View Correspondences and Geometry Using Order-Aware Network

Figure 4 for Learning Two-View Correspondences and Geometry Using Order-Aware Network

Abstract:Establishing correspondences between two images requires both local and global spatial context. Given putative correspondences of feature points in two views, in this paper, we propose Order-Aware Network, which infers the probabilities of correspondences being inliers and regresses the relative pose encoded by the essential matrix. Specifically, this proposed network is built hierarchically and comprises three novel operations. First, to capture the local context of sparse correspondences, the network clusters unordered input correspondences by learning a soft assignment matrix. These clusters are in a canonical order and invariant to input permutations. Next, the clusters are spatially correlated to form the global context of correspondences. After that, the context-encoded clusters are recovered back to the original size through a proposed upsampling operator. We intensively experiment on both outdoor and indoor datasets. The accuracy of the two-view geometry and correspondences are significantly improved over the state-of-the-arts. Code will be available at https://github.com/zjhthu/OANet.git.

* Accepted to ICCV 2019, and Winner solution to both tracks of CVPR IMW 2019 Challenge. Code will be available soon at https://github.com/zjhthu/OANet.git

Via

Access Paper or Ask Questions

Learning Fully Dense Neural Networks for Image Semantic Segmentation

May 22, 2019

Mingmin Zhen, Jinglu Wang, Lei Zhou, Tian Fang, Long Quan

Figure 1 for Learning Fully Dense Neural Networks for Image Semantic Segmentation

Figure 2 for Learning Fully Dense Neural Networks for Image Semantic Segmentation

Figure 3 for Learning Fully Dense Neural Networks for Image Semantic Segmentation

Figure 4 for Learning Fully Dense Neural Networks for Image Semantic Segmentation

Abstract:Semantic segmentation is pixel-wise classification which retains critical spatial information. The "feature map reuse" has been commonly adopted in CNN based approaches to take advantage of feature maps in the early layers for the later spatial reconstruction. Along this direction, we go a step further by proposing a fully dense neural network with an encoder-decoder structure that we abbreviate as FDNet. For each stage in the decoder module, feature maps of all the previous blocks are adaptively aggregated to feed-forward as input. On the one hand, it reconstructs the spatial boundaries accurately. On the other hand, it learns more efficiently with the more efficient gradient backpropagation. In addition, we propose the boundary-aware loss function to focus more attention on the pixels near the boundary, which boosts the "hard examples" labeling. We have demonstrated the best performance of the FDNet on the two benchmark datasets: PASCAL VOC 2012, NYUDv2 over previous works when not considering training on other datasets.

* AAAI 2019

Via

Access Paper or Ask Questions

ContextDesc: Local Descriptor Augmentation with Cross-Modality Context

Apr 08, 2019

Zixin Luo, Tianwei Shen, Lei Zhou, Jiahui Zhang, Yao Yao, Shiwei Li, Tian Fang, Long Quan

Figure 1 for ContextDesc: Local Descriptor Augmentation with Cross-Modality Context

Figure 2 for ContextDesc: Local Descriptor Augmentation with Cross-Modality Context

Figure 3 for ContextDesc: Local Descriptor Augmentation with Cross-Modality Context

Figure 4 for ContextDesc: Local Descriptor Augmentation with Cross-Modality Context

Abstract:Most existing studies on learning local features focus on the patch-based descriptions of individual keypoints, whereas neglecting the spatial relations established from their keypoint locations. In this paper, we go beyond the local detail representation by introducing context awareness to augment off-the-shelf local feature descriptors. Specifically, we propose a unified learning framework that leverages and aggregates the cross-modality contextual information, including (i) visual context from high-level image representation, and (ii) geometric context from 2D keypoint distribution. Moreover, we propose an effective N-pair loss that eschews the empirical hyper-parameter search and improves the convergence. The proposed augmentation scheme is lightweight compared with the raw local feature description, meanwhile improves remarkably on several large-scale benchmarks with diversified scenes, which demonstrates both strong practicality and generalization ability in geometric matching applications.

* Accepted to CVPR 2019 (oral), supplementary materials included. (https://github.com/lzx551402/contextdesc)

Via

Access Paper or Ask Questions

Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference

Feb 27, 2019

Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, Long Quan

Figure 1 for Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference

Figure 2 for Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference

Figure 3 for Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference

Figure 4 for Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference

Abstract:Deep learning has recently demonstrated its excellent performance for multi-view stereo (MVS). However, one major limitation of current learned MVS approaches is the scalability: the memory-consuming cost volume regularization makes the learned MVS hard to be applied to high-resolution scenes. In this paper, we introduce a scalable multi-view stereo framework based on the recurrent neural network. Instead of regularizing the entire 3D cost volume in one go, the proposed Recurrent Multi-view Stereo Network (R-MVSNet) sequentially regularizes the 2D cost maps along the depth direction via the gated recurrent unit (GRU). This reduces dramatically the memory consumption and makes high-resolution reconstruction feasible. We first show the state-of-the-art performance achieved by the proposed R-MVSNet on the recent MVS benchmarks. Then, we further demonstrate the scalability of the proposed method on several large-scale scenarios, where previous learned approaches often fail due to the memory constraint. Code is available at https://github.com/YoYo000/MVSNet.

* Accepted by CVPR2019

Via

Access Paper or Ask Questions