Alert button
Picture for Jia-Wang Bian

Jia-Wang Bian

Alert button

PoRF: Pose Residual Field for Accurate Neural Surface Reconstruction

Oct 12, 2023
Jia-Wang Bian, Wenjing Bian, Victor Adrian Prisacariu, Philip Torr

Figure 1 for PoRF: Pose Residual Field for Accurate Neural Surface Reconstruction
Figure 2 for PoRF: Pose Residual Field for Accurate Neural Surface Reconstruction
Figure 3 for PoRF: Pose Residual Field for Accurate Neural Surface Reconstruction
Figure 4 for PoRF: Pose Residual Field for Accurate Neural Surface Reconstruction

Neural surface reconstruction is sensitive to the camera pose noise, even if state-of-the-art pose estimators like COLMAP or ARKit are used. More importantly, existing Pose-NeRF joint optimisation methods have struggled to improve pose accuracy in challenging real-world scenarios. To overcome the challenges, we introduce the pose residual field (\textbf{PoRF}), a novel implicit representation that uses an MLP for regressing pose updates. This is more robust than the conventional pose parameter optimisation due to parameter sharing that leverages global information over the entire sequence. Furthermore, we propose an epipolar geometry loss to enhance the supervision that leverages the correspondences exported from COLMAP results without the extra computational overhead. Our method yields promising results. On the DTU dataset, we reduce the rotation error by 78\% for COLMAP poses, leading to the decreased reconstruction Chamfer distance from 3.48mm to 0.85mm. On the MobileBrick dataset that contains casually captured unbounded 360-degree videos, our method refines ARKit poses and improves the reconstruction F1 score from 69.18 to 75.67, outperforming that with the dataset provided ground-truth pose (75.14). These achievements demonstrate the efficacy of our approach in refining camera poses and improving the accuracy of neural surface reconstruction in real-world scenarios.

* Under review 
Viaarxiv icon

MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices

Mar 09, 2023
Kejie Li, Jia-Wang Bian, Robert Castle, Philip H. S. Torr, Victor Adrian Prisacariu

Figure 1 for MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices
Figure 2 for MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices
Figure 3 for MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices
Figure 4 for MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices

High-quality 3D ground-truth shapes are critical for 3D object reconstruction evaluation. However, it is difficult to create a replica of an object in reality, and even 3D reconstructions generated by 3D scanners have artefacts that cause biases in evaluation. To address this issue, we introduce a novel multi-view RGBD dataset captured using a mobile device, which includes highly precise 3D ground-truth annotations for 153 object models featuring a diverse set of 3D structures. We obtain precise 3D ground-truth shape without relying on high-end 3D scanners by utilising LEGO models with known geometry as the 3D structures for image capture. The distinct data modality offered by high-resolution RGB images and low-resolution depth maps captured on a mobile device, when combined with precise 3D geometry annotations, presents a unique opportunity for future research on high-fidelity 3D reconstruction. Furthermore, we evaluate a range of 3D reconstruction algorithms on the proposed dataset. Project page: http://code.active.vision/MobileBrick/

* To be appeared at CVPR 2023 
Viaarxiv icon

NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior

Dec 14, 2022
Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, Victor Adrian Prisacariu

Figure 1 for NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior
Figure 2 for NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior
Figure 3 for NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior
Figure 4 for NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior

Training a Neural Radiance Field (NeRF) without pre-computed camera poses is challenging. Recent advances in this direction demonstrate the possibility of jointly optimising a NeRF and camera poses in forward-facing scenes. However, these methods still face difficulties during dramatic camera movement. We tackle this challenging problem by incorporating undistorted monocular depth priors. These priors are generated by correcting scale and shift parameters during training, with which we are then able to constrain the relative poses between consecutive frames. This constraint is achieved using our proposed novel loss functions. Experiments on real-world indoor and outdoor scenes show that our method can handle challenging camera trajectories and outperforms existing methods in terms of novel view rendering quality and pose estimation accuracy.

Viaarxiv icon

Deep Negative Correlation Classification

Dec 14, 2022
Le Zhang, Qibin Hou, Yun Liu, Jia-Wang Bian, Xun Xu, Joey Tianyi Zhou, Ce Zhu

Figure 1 for Deep Negative Correlation Classification
Figure 2 for Deep Negative Correlation Classification
Figure 3 for Deep Negative Correlation Classification
Figure 4 for Deep Negative Correlation Classification

Ensemble learning serves as a straightforward way to improve the performance of almost any machine learning algorithm. Existing deep ensemble methods usually naively train many different models and then aggregate their predictions. This is not optimal in our view from two aspects: i) Naively training multiple models adds much more computational burden, especially in the deep learning era; ii) Purely optimizing each base model without considering their interactions limits the diversity of ensemble and performance gains. We tackle these issues by proposing deep negative correlation classification (DNCC), in which the accuracy and diversity trade-off is systematically controlled by decomposing the loss function seamlessly into individual accuracy and the correlation between individual models and the ensemble. DNCC yields a deep classification ensemble where the individual estimator is both accurate and negatively correlated. Thanks to the optimized diversities, DNCC works well even when utilizing a shared network backbone, which significantly improves its efficiency when compared with most existing ensemble systems. Extensive experiments on multiple benchmark datasets and network structures demonstrate the superiority of the proposed method.

Viaarxiv icon

SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for Dynamic Scenes

Nov 07, 2022
Libo Sun, Jia-Wang Bian, Huangying Zhan, Wei Yin, Ian Reid, Chunhua Shen

Figure 1 for SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for Dynamic Scenes
Figure 2 for SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for Dynamic Scenes
Figure 3 for SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for Dynamic Scenes
Figure 4 for SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for Dynamic Scenes

Self-supervised monocular depth estimation has shown impressive results in static scenes. It relies on the multi-view consistency assumption for training networks, however, that is violated in dynamic object regions and occlusions. Consequently, existing methods show poor accuracy in dynamic scenes, and the estimated depth map is blurred at object boundaries because they are usually occluded in other training views. In this paper, we propose SC-DepthV3 for addressing the challenges. Specifically, we introduce an external pretrained monocular depth estimation model for generating single-image depth prior, namely pseudo-depth, based on which we propose novel losses to boost self-supervised training. As a result, our model can predict sharp and accurate depth maps, even when training from monocular videos of highly-dynamic scenes. We demonstrate the significantly superior performance of our method over previous methods on six challenging datasets, and we provide detailed ablation studies for the proposed terms. Source code and data will be released at https://github.com/JiawangBian/sc_depth_pl

* Under Review; The code will be available at https://github.com/JiawangBian/sc_depth_pl 
Viaarxiv icon

Unsupervised Scale-consistent Depth Learning from Video

May 25, 2021
Jia-Wang Bian, Huangying Zhan, Naiyan Wang, Zhichao Li, Le Zhang, Chunhua Shen, Ming-Ming Cheng, Ian Reid

Figure 1 for Unsupervised Scale-consistent Depth Learning from Video
Figure 2 for Unsupervised Scale-consistent Depth Learning from Video
Figure 3 for Unsupervised Scale-consistent Depth Learning from Video
Figure 4 for Unsupervised Scale-consistent Depth Learning from Video

We propose a monocular depth estimator SC-Depth, which requires only unlabelled videos for training and enables the scale-consistent prediction at inference time. Our contributions include: (i) we propose a geometry consistency loss, which penalizes the inconsistency of predicted depths between adjacent views; (ii) we propose a self-discovered mask to automatically localize moving objects that violate the underlying static scene assumption and cause noisy signals during training; (iii) we demonstrate the efficacy of each component with a detailed ablation study and show high-quality depth estimation results in both KITTI and NYUv2 datasets. Moreover, thanks to the capability of scale-consistent prediction, we show that our monocular-trained deep networks are readily integrated into the ORB-SLAM2 system for more robust and accurate tracking. The proposed hybrid Pseudo-RGBD SLAM shows compelling results in KITTI, and it generalizes well to the KAIST dataset without additional training. Finally, we provide several demos for qualitative evaluation.

* Accept to IJCV. The source code is available at https://github.com/JiawangBian/SC-SfMLearner-Release 
Viaarxiv icon

DF-VO: What Should Be Learnt for Visual Odometry?

Mar 01, 2021
Huangying Zhan, Chamara Saroj Weerasekera, Jia-Wang Bian, Ravi Garg, Ian Reid

Figure 1 for DF-VO: What Should Be Learnt for Visual Odometry?
Figure 2 for DF-VO: What Should Be Learnt for Visual Odometry?
Figure 3 for DF-VO: What Should Be Learnt for Visual Odometry?
Figure 4 for DF-VO: What Should Be Learnt for Visual Odometry?

Multi-view geometry-based methods dominate the last few decades in monocular Visual Odometry for their superior performance, while they have been vulnerable to dynamic and low-texture scenes. More importantly, monocular methods suffer from scale-drift issue, i.e., errors accumulate over time. Recent studies show that deep neural networks can learn scene depths and relative camera in a self-supervised manner without acquiring ground truth labels. More surprisingly, they show that the well-trained networks enable scale-consistent predictions over long videos, while the accuracy is still inferior to traditional methods because of ignoring geometric information. Building on top of recent progress in computer vision, we design a simple yet robust VO system by integrating multi-view geometry and deep learning on Depth and optical Flow, namely DF-VO. In this work, a) we propose a method to carefully sample high-quality correspondences from deep flows and recover accurate camera poses with a geometric module; b) we address the scale-drift issue by aligning geometrically triangulated depths to the scale-consistent deep depths, where the dynamic scenes are taken into account. Comprehensive ablation studies show the effectiveness of the proposed method, and extensive evaluation results show the state-of-the-art performance of our system, e.g., Ours (1.652%) v.s. ORB-SLAM (3.247%}) in terms of translation error in KITTI Odometry benchmark. Source code is publicly available at: \href{https://github.com/Huangying-Zhan/DF-VO}{DF-VO}.

* extended version of ICRA-2020 paper (Visual Odometry Revisited: What Should Be Learnt?) 
Viaarxiv icon

MobileSal: Extremely Efficient RGB-D Salient Object Detection

Dec 24, 2020
Yu-Huan Wu, Yun Liu, Jun Xu, Jia-Wang Bian, Yuchao Gu, Ming-Ming Cheng

Figure 1 for MobileSal: Extremely Efficient RGB-D Salient Object Detection
Figure 2 for MobileSal: Extremely Efficient RGB-D Salient Object Detection
Figure 3 for MobileSal: Extremely Efficient RGB-D Salient Object Detection
Figure 4 for MobileSal: Extremely Efficient RGB-D Salient Object Detection

The high computational cost of neural networks has prevented recent successes in RGB-D salient object detection (SOD) from benefiting real-world applications. Hence, this paper introduces a novel network, \methodname, which focuses on efficient RGB-D SOD by using mobile networks for deep feature extraction. The problem is that mobile networks are less powerful in feature representation than cumbersome networks. To this end, we observe that the depth information of color images can strengthen the feature representation related to SOD if leveraged properly. Therefore, we propose an implicit depth restoration (IDR) technique to strengthen the feature representation capability of mobile networks for RGB-D SOD. IDR is only adopted in the training phase and is omitted during testing, so it is computationally free. Besides, we propose compact pyramid refinement (CPR) for efficient multi-level feature aggregation so that we can derive salient objects with clear boundaries. With IDR and CPR incorporated, \methodname~performs favorably against \sArt methods on seven challenging RGB-D SOD datasets with much faster speed (450fps) and fewer parameters (6.5M). The code will be released.

Viaarxiv icon

Diverse Knowledge Distillation for End-to-End Person Search

Dec 21, 2020
Xinyu Zhang, Xinlong Wang, Jia-Wang Bian, Chunhua Shen, Mingyu You

Figure 1 for Diverse Knowledge Distillation for End-to-End Person Search
Figure 2 for Diverse Knowledge Distillation for End-to-End Person Search
Figure 3 for Diverse Knowledge Distillation for End-to-End Person Search
Figure 4 for Diverse Knowledge Distillation for End-to-End Person Search

Person search aims to localize and identify a specific person from a gallery of images. Recent methods can be categorized into two groups, i.e., two-step and end-to-end approaches. The former views person search as two independent tasks and achieves dominant results using separately trained person detection and re-identification (Re-ID) models. The latter performs person search in an end-to-end fashion. Although the end-to-end approaches yield higher inference efficiency, they largely lag behind those two-step counterparts in terms of accuracy. In this paper, we argue that the gap between the two kinds of methods is mainly caused by the Re-ID sub-networks of end-to-end methods. To this end, we propose a simple yet strong end-to-end network with diverse knowledge distillation to break the bottleneck. We also design a spatial-invariant augmentation to assist model to be invariant to inaccurate detection results. Experimental results on the CUHK-SYSU and PRW datasets demonstrate the superiority of our method against existing approaches -- it achieves on par accuracy with state-of-the-art two-step methods while maintaining high efficiency due to the single joint model. Code is available at: https://git.io/DKD-PersonSearch.

* Accepted to AAAI, 2021. Code is available at: https://git.io/DKD-PersonSearch 
Viaarxiv icon