Alert button
Picture for Pan Ji

Pan Ji

Alert button

RGB-based Category-level Object Pose Estimation via Decoupled Metric Scale Recovery

Sep 19, 2023
Jiaxin Wei, Xibin Song, Weizhe Liu, Laurent Kneip, Hongdong Li, Pan Ji

While showing promising results, recent RGB-D camera-based category-level object pose estimation methods have restricted applications due to the heavy reliance on depth sensors. RGB-only methods provide an alternative to this problem yet suffer from inherent scale ambiguity stemming from monocular observations. In this paper, we propose a novel pipeline that decouples the 6D pose and size estimation to mitigate the influence of imperfect scales on rigid transformations. Specifically, we leverage a pre-trained monocular estimator to extract local geometric information, mainly facilitating the search for inlier 2D-3D correspondence. Meanwhile, a separate branch is designed to directly recover the metric scale of the object based on category-level statistics. Finally, we advocate using the RANSAC-P$n$P algorithm to robustly solve for 6D object pose. Extensive experiments have been conducted on both synthetic and real datasets, demonstrating the superior performance of our method over previous state-of-the-art RGB-based approaches, especially in terms of rotation accuracy.

Viaarxiv icon

Dynamic Voxel Grid Optimization for High-Fidelity RGB-D Supervised Surface Reconstruction

Apr 12, 2023
Xiangyu Xu, Lichang Chen, Changjiang Cai, Huangying Zhan, Qingan Yan, Pan Ji, Junsong Yuan, Heng Huang, Yi Xu

Figure 1 for Dynamic Voxel Grid Optimization for High-Fidelity RGB-D Supervised Surface Reconstruction
Figure 2 for Dynamic Voxel Grid Optimization for High-Fidelity RGB-D Supervised Surface Reconstruction
Figure 3 for Dynamic Voxel Grid Optimization for High-Fidelity RGB-D Supervised Surface Reconstruction
Figure 4 for Dynamic Voxel Grid Optimization for High-Fidelity RGB-D Supervised Surface Reconstruction

Direct optimization of interpolated features on multi-resolution voxel grids has emerged as a more efficient alternative to MLP-like modules. However, this approach is constrained by higher memory expenses and limited representation capabilities. In this paper, we introduce a novel dynamic grid optimization method for high-fidelity 3D surface reconstruction that incorporates both RGB and depth observations. Rather than treating each voxel equally, we optimize the process by dynamically modifying the grid and assigning more finer-scale voxels to regions with higher complexity, allowing us to capture more intricate details. Furthermore, we develop a scheme to quantify the dynamic subdivision of voxel grid during optimization without requiring any priors. The proposed approach is able to generate high-quality 3D reconstructions with fine details on both synthetic and real-world data, while maintaining computational efficiency, which is substantially faster than the baseline method NeuralRGBD.

* For the project, see https://yanqingan.github.io/ 
Viaarxiv icon

CLIP-FLow: Contrastive Learning by semi-supervised Iterative Pseudo labeling for Optical Flow Estimation

Nov 01, 2022
Zhiqi Zhang, Pan Ji, Nitin Bansal, Changjiang Cai, Qingan Yan, Xiangyu Xu, Yi Xu

Figure 1 for CLIP-FLow: Contrastive Learning by semi-supervised Iterative Pseudo labeling for Optical Flow Estimation
Figure 2 for CLIP-FLow: Contrastive Learning by semi-supervised Iterative Pseudo labeling for Optical Flow Estimation
Figure 3 for CLIP-FLow: Contrastive Learning by semi-supervised Iterative Pseudo labeling for Optical Flow Estimation
Figure 4 for CLIP-FLow: Contrastive Learning by semi-supervised Iterative Pseudo labeling for Optical Flow Estimation

Synthetic datasets are often used to pretrain end-to-end optical flow networks, due to the lack of a large amount of labeled, real-scene data. But major drops in accuracy occur when moving from synthetic to real scenes. How do we better transfer the knowledge learned from synthetic to real domains? To this end, we propose CLIP-FLow, a semi-supervised iterative pseudo-labeling framework to transfer the pretraining knowledge to the target real domain. We leverage large-scale, unlabeled real data to facilitate transfer learning with the supervision of iteratively updated pseudo-ground truth labels, bridging the domain gap between the synthetic and the real. In addition, we propose a contrastive flow loss on reference features and the warped features by pseudo ground truth flows, to further boost the accurate matching and dampen the mismatching due to motion, occlusion, or noisy pseudo labels. We adopt RAFT as the backbone and obtain an F1-all error of 4.11%, i.e. a 19% error reduction from RAFT (5.10%) and ranking 2$^{nd}$ place at submission on the KITTI 2015 benchmark. Our framework can also be extended to other models, e.g. CRAFT, reducing the F1-all error from 4.79% to 4.66% on KITTI 2015 benchmark.

Viaarxiv icon

MonoIndoor++:Towards Better Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments

Jul 18, 2022
Runze Li, Pan Ji, Yi Xu, Bir Bhanu

Figure 1 for MonoIndoor++:Towards Better Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments
Figure 2 for MonoIndoor++:Towards Better Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments
Figure 3 for MonoIndoor++:Towards Better Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments
Figure 4 for MonoIndoor++:Towards Better Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments

Self-supervised monocular depth estimation has seen significant progress in recent years, especially in outdoor environments. However, depth prediction results are not satisfying in indoor scenes where most of the existing data are captured with hand-held devices. As compared to outdoor environments, estimating depth of monocular videos for indoor environments, using self-supervised methods, results in two additional challenges: (i) the depth range of indoor video sequences varies a lot across different frames, making it difficult for the depth network to induce consistent depth cues for training; (ii) the indoor sequences recorded with handheld devices often contain much more rotational motions, which cause difficulties for the pose network to predict accurate relative camera poses. In this work, we propose a novel framework-MonoIndoor++ by giving special considerations to those challenges and consolidating a set of good practices for improving the performance of self-supervised monocular depth estimation for indoor environments. First, a depth factorization module with transformer-based scale regression network is proposed to estimate a global depth scale factor explicitly, and the predicted scale factor can indicate the maximum depth values. Second, rather than using a single-stage pose estimation strategy as in previous methods, we propose to utilize a residual pose estimation module to estimate relative camera poses across consecutive frames iteratively. Third, to incorporate extensive coordinates guidance for our residual pose estimation module, we propose to perform coordinate convolutional encoding directly over the inputs to pose networks. The proposed method is validated on a variety of benchmark indoor datasets, i.e., EuRoC MAV, NYUv2, ScanNet and 7-Scenes, demonstrating the state-of-the-art performance.

* Journal version of "MonoIndoor: Towards Good Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments"(ICCV-2021). arXiv admin note: substantial text overlap with arXiv:2107.12429 
Viaarxiv icon

Semantics-Depth-Symbiosis: Deeply Coupled Semi-Supervised Learning of Semantics and Depth

Jun 21, 2022
Nitin Bansal, Pan Ji, Junsong Yuan, Yi Xu

Figure 1 for Semantics-Depth-Symbiosis: Deeply Coupled Semi-Supervised Learning of Semantics and Depth
Figure 2 for Semantics-Depth-Symbiosis: Deeply Coupled Semi-Supervised Learning of Semantics and Depth
Figure 3 for Semantics-Depth-Symbiosis: Deeply Coupled Semi-Supervised Learning of Semantics and Depth
Figure 4 for Semantics-Depth-Symbiosis: Deeply Coupled Semi-Supervised Learning of Semantics and Depth

Multi-task learning (MTL) paradigm focuses on jointly learning two or more tasks, aiming for significant improvement w.r.t model's generalizability, performance, and training/inference memory footprint. The aforementioned benefits become ever so indispensable in the case of joint training for vision-related {\bf dense} prediction tasks. In this work, we tackle the MTL problem of two dense tasks, \ie, semantic segmentation and depth estimation, and present a novel attention module called Cross-Channel Attention Module ({CCAM}), which facilitates effective feature sharing along each channel between the two tasks, leading to mutual performance gain with a negligible increase in trainable parameters. In a true symbiotic spirit, we then formulate a novel data augmentation for the semantic segmentation task using predicted depth called {AffineMix}, and a simple depth augmentation using predicted semantics called {ColorAug}. Finally, we validate the performance gain of the proposed method on the Cityscapes dataset, which helps us achieve state-of-the-art results for a semi-supervised joint model based on depth and semantic segmentation.

Viaarxiv icon

Strengthening Skeletal Action Recognizers via Leveraging Temporal Patterns

Jun 04, 2022
Zhenyue Qin, Pan Ji, Dongwoo Kim, Yang Liu, Saeed Anwar, Tom Gedeon

Figure 1 for Strengthening Skeletal Action Recognizers via Leveraging Temporal Patterns
Figure 2 for Strengthening Skeletal Action Recognizers via Leveraging Temporal Patterns
Figure 3 for Strengthening Skeletal Action Recognizers via Leveraging Temporal Patterns
Figure 4 for Strengthening Skeletal Action Recognizers via Leveraging Temporal Patterns

Skeleton sequences are compact and lightweight. Numerous skeleton-based action recognizers have been proposed to classify human behaviors. In this work, we aim to incorporate components that are compatible with existing models and further improve their accuracy. To this end, we design two temporal accessories: discrete cosine encoding (DCE) and chronological loss (CRL). DCE facilitates models to analyze motion patterns from the frequency domain and meanwhile alleviates the influence of signal noise. CRL guides networks to explicitly capture the sequence's chronological order. These two components consistently endow many recently-proposed action recognizers with accuracy boosts, achieving new state-of-the-art (SOTA) accuracy on two large benchmark datasets (NTU60 and NTU120).

Viaarxiv icon

RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo

May 28, 2022
Changjiang Cai, Pan Ji, Yi Xu

Figure 1 for RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo
Figure 2 for RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo
Figure 3 for RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo
Figure 4 for RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo

In this paper, we present a learning-based approach for multi-view stereo (MVS), i.e., estimate the depth map of a reference frame using posed multi-view images. Our core idea lies in leveraging a "learning-to-optimize" paradigm to iteratively index a plane-sweeping cost volume and regress the depth map via a convolutional Gated Recurrent Unit (GRU). Since the cost volume plays a paramount role in encoding the multi-view geometry, we aim to improve its construction both in pixel- and frame- levels. In the pixel level, we propose to break the symmetry of the Siamese network (which is typically used in MVS to extract image features) by introducing a transformer block to the reference image (but not to the source images). Such an asymmetric volume allows the network to extract global features from the reference image to predict its depth map. In view of the inaccuracy of poses between reference and source images, we propose to incorporate a residual pose network to make corrections to the relative poses, which essentially rectifies the cost volume in the frame-level. We conduct extensive experiments on real-world MVS datasets and show that our method achieves state-of-the-art performance in terms of both within-dataset evaluation and cross-dataset generalization.

Viaarxiv icon

ANUBIS: Skeleton Action Recognition Dataset, Review, and Benchmark

May 08, 2022
Zhenyue Qin, Yang Liu, Madhawa Perera, Tom Gedeon, Pan Ji, Dongwoo Kim, Saeed Anwar

Figure 1 for ANUBIS: Skeleton Action Recognition Dataset, Review, and Benchmark
Figure 2 for ANUBIS: Skeleton Action Recognition Dataset, Review, and Benchmark
Figure 3 for ANUBIS: Skeleton Action Recognition Dataset, Review, and Benchmark
Figure 4 for ANUBIS: Skeleton Action Recognition Dataset, Review, and Benchmark

Skeleton-based action recognition, as a subarea of action recognition, is swiftly accumulating attention and popularity. The task is to recognize actions performed by human articulation points. Compared with other data modalities, 3D human skeleton representations have extensive unique desirable characteristics, including succinctness, robustness, racial-impartiality, and many more. We aim to provide a roadmap for new and existing researchers a on the landscapes of skeleton-based action recognition for new and existing researchers. To this end, we present a review in the form of a taxonomy on existing works of skeleton-based action recognition. We partition them into four major categories: (1) datasets; (2) extracting spatial features; (3) capturing temporal patterns; (4) improving signal quality. For each method, we provide concise yet informatively-sufficient descriptions. To promote more fair and comprehensive evaluation on existing approaches of skeleton-based action recognition, we collect ANUBIS, a large-scale human skeleton dataset. Compared with previously collected dataset, ANUBIS are advantageous in the following four aspects: (1) employing more recently released sensors; (2) containing novel back view; (3) encouraging high enthusiasm of subjects; (4) including actions of the COVID pandemic era. Using ANUBIS, we comparably benchmark performance of current skeleton-based action recognizers. At the end of this paper, we outlook future development of skeleton-based action recognition by listing several new technical problems. We believe they are valuable to solve in order to commercialize skeleton-based action recognition in the near future. The dataset of ANUBIS is available at: http://hcc-workshop.anu.edu.au/webs/anu101/home.

Viaarxiv icon

CNN-Augmented Visual-Inertial SLAM with Planar Constraints

May 05, 2022
Pan Ji, Yuan Tian, Qingan Yan, Yuxin Ma, Yi Xu

Figure 1 for CNN-Augmented Visual-Inertial SLAM with Planar Constraints
Figure 2 for CNN-Augmented Visual-Inertial SLAM with Planar Constraints
Figure 3 for CNN-Augmented Visual-Inertial SLAM with Planar Constraints
Figure 4 for CNN-Augmented Visual-Inertial SLAM with Planar Constraints

We present a robust visual-inertial SLAM system that combines the benefits of Convolutional Neural Networks (CNNs) and planar constraints. Our system leverages a CNN to predict the depth map and the corresponding uncertainty map for each image. The CNN depth effectively bootstraps the back-end optimization of SLAM and meanwhile the CNN uncertainty adaptively weighs the contribution of each feature point to the back-end optimization. Given the gravity direction from the inertial sensor, we further present a fast plane detection method that detects horizontal planes via one-point RANSAC and vertical planes via two-point RANSAC. Those stably detected planes are in turn used to regularize the back-end optimization of SLAM. We evaluate our system on a public dataset, \ie, EuRoC, and demonstrate improved results over a state-of-the-art SLAM system, \ie, ORB-SLAM3.

Viaarxiv icon

FisheyeDistill: Self-Supervised Monocular Depth Estimation with Ordinal Distillation for Fisheye Cameras

May 05, 2022
Qingan Yan, Pan Ji, Nitin Bansal, Yuxin Ma, Yuan Tian, Yi Xu

Figure 1 for FisheyeDistill: Self-Supervised Monocular Depth Estimation with Ordinal Distillation for Fisheye Cameras
Figure 2 for FisheyeDistill: Self-Supervised Monocular Depth Estimation with Ordinal Distillation for Fisheye Cameras
Figure 3 for FisheyeDistill: Self-Supervised Monocular Depth Estimation with Ordinal Distillation for Fisheye Cameras
Figure 4 for FisheyeDistill: Self-Supervised Monocular Depth Estimation with Ordinal Distillation for Fisheye Cameras

In this paper, we deal with the problem of monocular depth estimation for fisheye cameras in a self-supervised manner. A known issue of self-supervised depth estimation is that it suffers in low-light/over-exposure conditions and in large homogeneous regions. To tackle this issue, we propose a novel ordinal distillation loss that distills the ordinal information from a large teacher model. Such a teacher model, since having been trained on a large amount of diverse data, can capture the depth ordering information well, but lacks in preserving accurate scene geometry. Combined with self-supervised losses, we show that our model can not only generate reasonable depth maps in challenging environments but also better recover the scene geometry. We further leverage the fisheye cameras of an AR-Glasses device to collect an indoor dataset to facilitate evaluation.

Viaarxiv icon