Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chi-Wing Fu

Guided Point Contrastive Learning for Semi-supervised Point Cloud Semantic Segmentation

Oct 15, 2021
Li Jiang, Shaoshuai Shi, Zhuotao Tian, Xin Lai, Shu Liu, Chi-Wing Fu, Jiaya Jia

Figure 1 for Guided Point Contrastive Learning for Semi-supervised Point Cloud Semantic Segmentation

Figure 2 for Guided Point Contrastive Learning for Semi-supervised Point Cloud Semantic Segmentation

Figure 3 for Guided Point Contrastive Learning for Semi-supervised Point Cloud Semantic Segmentation

Figure 4 for Guided Point Contrastive Learning for Semi-supervised Point Cloud Semantic Segmentation

Rapid progress in 3D semantic segmentation is inseparable from the advances of deep network models, which highly rely on large-scale annotated data for training. To address the high cost and challenges of 3D point-level labeling, we present a method for semi-supervised point cloud semantic segmentation to adopt unlabeled point clouds in training to boost the model performance. Inspired by the recent contrastive loss in self-supervised tasks, we propose the guided point contrastive loss to enhance the feature representation and model generalization ability in semi-supervised setting. Semantic predictions on unlabeled point clouds serve as pseudo-label guidance in our loss to avoid negative pairs in the same category. Also, we design the confidence guidance to ensure high-quality feature learning. Besides, a category-balanced sampling strategy is proposed to collect positive and negative samples to mitigate the class imbalance problem. Extensive experiments on three datasets (ScanNet V2, S3DIS, and SemanticKITTI) show the effectiveness of our semi-supervised method to improve the prediction quality with unlabeled data.

* ICCV 2021

Via

Access Paper or Ask Questions

Towards Accurate Alignment in Real-time 3D Hand-Mesh Reconstruction

Sep 03, 2021
Xiao Tang, Tianyu Wang, Chi-Wing Fu

Figure 1 for Towards Accurate Alignment in Real-time 3D Hand-Mesh Reconstruction

Figure 2 for Towards Accurate Alignment in Real-time 3D Hand-Mesh Reconstruction

Figure 3 for Towards Accurate Alignment in Real-time 3D Hand-Mesh Reconstruction

Figure 4 for Towards Accurate Alignment in Real-time 3D Hand-Mesh Reconstruction

3D hand-mesh reconstruction from RGB images facilitates many applications, including augmented reality (AR). However, this requires not only real-time speed and accurate hand pose and shape but also plausible mesh-image alignment. While existing works already achieve promising results, meeting all three requirements is very challenging. This paper presents a novel pipeline by decoupling the hand-mesh reconstruction task into three stages: a joint stage to predict hand joints and segmentation; a mesh stage to predict a rough hand mesh; and a refine stage to fine-tune it with an offset mesh for mesh-image alignment. With careful design in the network structure and in the loss functions, we can promote high-quality finger-level mesh-image alignment and drive the models together to deliver real-time predictions. Extensive quantitative and qualitative results on benchmark datasets demonstrate that the quality of our results outperforms the state-of-the-art methods on hand-mesh/pose precision and hand-image alignment. In the end, we also showcase several real-time AR scenarios.

Via

Access Paper or Ask Questions

SP-GAN: Sphere-Guided 3D Shape Generation and Manipulation

Aug 10, 2021
Ruihui Li, Xianzhi Li, Ka-Hei Hui, Chi-Wing Fu

Figure 1 for SP-GAN: Sphere-Guided 3D Shape Generation and Manipulation

Figure 2 for SP-GAN: Sphere-Guided 3D Shape Generation and Manipulation

Figure 3 for SP-GAN: Sphere-Guided 3D Shape Generation and Manipulation

Figure 4 for SP-GAN: Sphere-Guided 3D Shape Generation and Manipulation

We present SP-GAN, a new unsupervised sphere-guided generative model for direct synthesis of 3D shapes in the form of point clouds. Compared with existing models, SP-GAN is able to synthesize diverse and high-quality shapes with fine details and promote controllability for part-aware shape generation and manipulation, yet trainable without any parts annotations. In SP-GAN, we incorporate a global prior (uniform points on a sphere) to spatially guide the generative process and attach a local prior (a random latent code) to each sphere point to provide local details. The key insight in our design is to disentangle the complex 3D shape generation task into a global shape modeling and a local structure adjustment, to ease the learning process and enhance the shape generation quality. Also, our model forms an implicit dense correspondence between the sphere points and points in every generated shape, enabling various forms of structure-aware shape manipulations such as part editing, part-wise shape interpolation, and multi-shape part composition, etc., beyond the existing generative models. Experimental results, which include both visual and quantitative evaluations, demonstrate that our model is able to synthesize diverse point clouds with fine details and less noise, as compared with the state-of-the-art models.

* ACM Trans. Graph., Vol. 40, No. 4, Article 151. Publication date: August 2021
* SIGGRAPH 2021, website https://liruihui.github.io/publication/SP-GAN/

Via

Access Paper or Ask Questions

Accurate Grid Keypoint Learning for Efficient Video Prediction

Jul 28, 2021
Xiaojie Gao, Yueming Jin, Qi Dou, Chi-Wing Fu, Pheng-Ann Heng

Figure 1 for Accurate Grid Keypoint Learning for Efficient Video Prediction

Figure 2 for Accurate Grid Keypoint Learning for Efficient Video Prediction

Figure 3 for Accurate Grid Keypoint Learning for Efficient Video Prediction

Figure 4 for Accurate Grid Keypoint Learning for Efficient Video Prediction

Video prediction methods generally consume substantial computing resources in training and deployment, among which keypoint-based approaches show promising improvement in efficiency by simplifying dense image prediction to light keypoint prediction. However, keypoint locations are often modeled only as continuous coordinates, so noise from semantically insignificant deviations in videos easily disrupt learning stability, leading to inaccurate keypoint modeling. In this paper, we design a new grid keypoint learning framework, aiming at a robust and explainable intermediate keypoint representation for long-term efficient video prediction. We have two major technical contributions. First, we detect keypoints by jumping among candidate locations in our raised grid space and formulate a condensation loss to encourage meaningful keypoints with strong representative capability. Second, we introduce a 2D binary map to represent the detected grid keypoints and then suggest propagating keypoint locations with stochasticity by selecting entries in the discrete grid space, thus preserving the spatial structure of keypoints in the longterm horizon for better future frame generation. Extensive experiments verify that our method outperforms the state-ofthe-art stochastic video prediction methods while saves more than 98% of computing resources. We also demonstrate our method on a robotic-assisted surgery dataset with promising results. Our code is available at https://github.com/xjgaocs/Grid-Keypoint-Learning.

* IROS2021

Via

Access Paper or Ask Questions

FloorLevel-Net: Recognizing Floor-Level Lines with Height-Attention-Guided Multi-task Learning

Jul 06, 2021
Mengyang Wu, Wei Zeng, Chi-Wing Fu

Figure 1 for FloorLevel-Net: Recognizing Floor-Level Lines with Height-Attention-Guided Multi-task Learning

Figure 2 for FloorLevel-Net: Recognizing Floor-Level Lines with Height-Attention-Guided Multi-task Learning

Figure 3 for FloorLevel-Net: Recognizing Floor-Level Lines with Height-Attention-Guided Multi-task Learning

Figure 4 for FloorLevel-Net: Recognizing Floor-Level Lines with Height-Attention-Guided Multi-task Learning

The ability to recognize the position and order of the floor-level lines that divide adjacent building floors can benefit many applications, for example, urban augmented reality (AR). This work tackles the problem of locating floor-level lines in street-view images, using a supervised deep learning approach. Unfortunately, very little data is available for training such a network $-$ current street-view datasets contain either semantic annotations that lack geometric attributes, or rectified facades without perspective priors. To address this issue, we first compile a new dataset and develop a new data augmentation scheme to synthesize training samples by harassing (i) the rich semantics of existing rectified facades and (ii) perspective priors of buildings in diverse street views. Next, we design FloorLevel-Net, a multi-task learning network that associates explicit features of building facades and implicit floor-level lines, along with a height-attention mechanism to help enforce a vertical ordering of floor-level lines. The generated segmentations are then passed to a second-stage geometry post-processing to exploit self-constrained geometric priors for plausible and consistent reconstruction of floor-level lines. Quantitative and qualitative evaluations conducted on assorted facades in existing datasets and street views from Google demonstrate the effectiveness of our approach. Also, we present context-aware image overlay results and show the potentials of our approach in enriching AR-related applications.

Via

Access Paper or Ask Questions

Point Cloud Upsampling via Disentangled Refinement

Jun 09, 2021
Ruihui Li, Xianzhi Li, Pheng-Ann Heng, Chi-Wing Fu

Figure 1 for Point Cloud Upsampling via Disentangled Refinement

Figure 2 for Point Cloud Upsampling via Disentangled Refinement

Figure 3 for Point Cloud Upsampling via Disentangled Refinement

Figure 4 for Point Cloud Upsampling via Disentangled Refinement

Point clouds produced by 3D scanning are often sparse, non-uniform, and noisy. Recent upsampling approaches aim to generate a dense point set, while achieving both distribution uniformity and proximity-to-surface, and possibly amending small holes, all in a single network. After revisiting the task, we propose to disentangle the task based on its multi-objective nature and formulate two cascaded sub-networks, a dense generator and a spatial refiner. The dense generator infers a coarse but dense output that roughly describes the underlying surface, while the spatial refiner further fine-tunes the coarse output by adjusting the location of each point. Specifically, we design a pair of local and global refinement units in the spatial refiner to evolve a coarse feature map. Also, in the spatial refiner, we regress a per-point offset vector to further adjust the coarse outputs in fine-scale. Extensive qualitative and quantitative results on both synthetic and real-scanned datasets demonstrate the superiority of our method over the state-of-the-arts.

* CVPR 2021, website https://liruihui.github.io/

Via

Access Paper or Ask Questions

SE-SSD: Self-Ensembling Single-Stage Object Detector From Point Cloud

Apr 20, 2021
Wu Zheng, Weiliang Tang, Li Jiang, Chi-Wing Fu

Figure 1 for SE-SSD: Self-Ensembling Single-Stage Object Detector From Point Cloud

Figure 2 for SE-SSD: Self-Ensembling Single-Stage Object Detector From Point Cloud

Figure 3 for SE-SSD: Self-Ensembling Single-Stage Object Detector From Point Cloud

Figure 4 for SE-SSD: Self-Ensembling Single-Stage Object Detector From Point Cloud

We present Self-Ensembling Single-Stage object Detector (SE-SSD) for accurate and efficient 3D object detection in outdoor point clouds. Our key focus is on exploiting both soft and hard targets with our formulated constraints to jointly optimize the model, without introducing extra computation in the inference. Specifically, SE-SSD contains a pair of teacher and student SSDs, in which we design an effective IoU-based matching strategy to filter soft targets from the teacher and formulate a consistency loss to align student predictions with them. Also, to maximize the distilled knowledge for ensembling the teacher, we design a new augmentation scheme to produce shape-aware augmented samples to train the student, aiming to encourage it to infer complete object shapes. Lastly, to better exploit hard targets, we design an ODIoU loss to supervise the student with constraints on the predicted box centers and orientations. Our SE-SSD attains top performance compared with all prior published works. Also, it attains top precisions for car detection in the KITTI benchmark (ranked 1st and 2nd on the BEV and 3D leaderboards, respectively) with an ultra-high inference speed. The code is available at https://github.com/Vegeta2020/SE-SSD.

* Accepted by CVPR 2021

Via

Access Paper or Ask Questions

One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation

Apr 16, 2021
Zhengzhe Liu, Xiaojuan Qi, Chi-Wing Fu

Figure 1 for One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation

Figure 2 for One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation

Figure 3 for One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation

Figure 4 for One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation

Point cloud semantic segmentation often requires largescale annotated training data, but clearly, point-wise labels are too tedious to prepare. While some recent methods propose to train a 3D network with small percentages of point labels, we take the approach to an extreme and propose "One Thing One Click," meaning that the annotator only needs to label one point per object. To leverage these extremely sparse labels in network training, we design a novel self-training approach, in which we iteratively conduct the training and label propagation, facilitated by a graph propagation module. Also, we adopt a relation network to generate per-category prototype and explicitly model the similarity among graph nodes to generate pseudo labels to guide the iterative training. Experimental results on both ScanNet-v2 and S3DIS show that our self-training approach, with extremely-sparse annotations, outperforms all existing weakly supervised methods for 3D semantic segmentation by a large margin, and our results are also comparable to those of the fully supervised counterparts.

* Accepted by CVPR 2021

Via

Access Paper or Ask Questions

3D-to-2D Distillation for Indoor Scene Parsing

Apr 07, 2021
Zhengzhe Liu, Xiaojuan Qi, Chi-Wing Fu

Figure 1 for 3D-to-2D Distillation for Indoor Scene Parsing

Figure 2 for 3D-to-2D Distillation for Indoor Scene Parsing

Figure 3 for 3D-to-2D Distillation for Indoor Scene Parsing

Figure 4 for 3D-to-2D Distillation for Indoor Scene Parsing

Indoor scene semantic parsing from RGB images is very challenging due to occlusions, object distortion, and viewpoint variations. Going beyond prior works that leverage geometry information, typically paired depth maps, we present a new approach, a 3D-to-2D distillation framework, that enables us to leverage 3D features extracted from large-scale 3D data repository (e.g., ScanNet-v2) to enhance 2D features extracted from RGB images. Our work has three novel contributions. First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training, so the 2D network can infer without requiring 3D data. Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration. Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data. Extensive experiments on various datasets, ScanNet-V2, S3DIS, and NYU-v2, demonstrate the superiority of our approach. Also, experimental results show that our 3D-to-2D distillation improves the model generalization.

* Accepted by CVPR 2021

Via

Access Paper or Ask Questions