Alert button
Picture for Chenhang He

Chenhang He

Alert button

Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations

May 18, 2023
Weiwei Lin, Chenhang He, Man-Wai Mak, Youzhi Tu

Figure 1 for Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations
Figure 2 for Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations
Figure 3 for Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations
Figure 4 for Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations

Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition (ASR) and proved to be extremely useful in low label-resource settings. However, the success of SSL models has yet to transfer to utterance-level tasks such as speaker, emotion, and language recognition, which still require supervised fine-tuning of the SSL models to obtain good performance. We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective for these tasks. Inspired by how HuBERT uses clustering to discover hidden acoustic units, we formulate a factor analysis (FA) model that uses the discovered hidden acoustic units to align the SSL features. The underlying utterance-level representations are disentangled from the content of speech using probabilistic inference on the aligned features. Furthermore, the variational lower bound derived from the FA model provides an utterance-level objective, allowing error gradients to be backpropagated to the Transformer layers to learn highly discriminative acoustic units. When used in conjunction with HuBERT's masked prediction training, our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.

* accepted by ICML 2023 
Viaarxiv icon

One-to-Few Label Assignment for End-to-End Dense Detection

Mar 21, 2023
Shuai Li, Minghan Li, Ruihuang Li, Chenhang He, Lei Zhang

Figure 1 for One-to-Few Label Assignment for End-to-End Dense Detection
Figure 2 for One-to-Few Label Assignment for End-to-End Dense Detection
Figure 3 for One-to-Few Label Assignment for End-to-End Dense Detection
Figure 4 for One-to-Few Label Assignment for End-to-End Dense Detection

One-to-one (o2o) label assignment plays a key role for transformer based end-to-end detection, and it has been recently introduced in fully convolutional detectors for end-to-end dense detection. However, o2o can degrade the feature learning efficiency due to the limited number of positive samples. Though extra positive samples are introduced to mitigate this issue in recent DETRs, the computation of self- and cross- attentions in the decoder limits its practical application to dense and fully convolutional detectors. In this work, we propose a simple yet effective one-to-few (o2f) label assignment strategy for end-to-end dense detection. Apart from defining one positive and many negative anchors for each object, we define several soft anchors, which serve as positive and negative samples simultaneously. The positive and negative weights of these soft anchors are dynamically adjusted during training so that they can contribute more to ``representation learning'' in the early training stage, and contribute more to ``duplicated prediction removal'' in the later stage. The detector trained in this way can not only learn a strong feature representation but also perform end-to-end dense detection. Experiments on COCO and CrowdHuman datasets demonstrate the effectiveness of the o2f scheme. Code is available at https://github.com/strongwolf/o2f.

* Accepted by CVPR2023 
Viaarxiv icon

MSF: Motion-guided Sequential Fusion for Efficient 3D Object Detection from Point Cloud Sequences

Mar 15, 2023
Chenhang He, Ruihuang Li, Yabin Zhang, Shuai Li, Lei Zhang

Figure 1 for MSF: Motion-guided Sequential Fusion for Efficient 3D Object Detection from Point Cloud Sequences
Figure 2 for MSF: Motion-guided Sequential Fusion for Efficient 3D Object Detection from Point Cloud Sequences
Figure 3 for MSF: Motion-guided Sequential Fusion for Efficient 3D Object Detection from Point Cloud Sequences
Figure 4 for MSF: Motion-guided Sequential Fusion for Efficient 3D Object Detection from Point Cloud Sequences

Point cloud sequences are commonly used to accurately detect 3D objects in applications such as autonomous driving. Current top-performing multi-frame detectors mostly follow a Detect-and-Fuse framework, which extracts features from each frame of the sequence and fuses them to detect the objects in the current frame. However, this inevitably leads to redundant computation since adjacent frames are highly correlated. In this paper, we propose an efficient Motion-guided Sequential Fusion (MSF) method, which exploits the continuity of object motion to mine useful sequential contexts for object detection in the current frame. We first generate 3D proposals on the current frame and propagate them to preceding frames based on the estimated velocities. The points-of-interest are then pooled from the sequence and encoded as proposal features. A novel Bidirectional Feature Aggregation (BiFA) module is further proposed to facilitate the interactions of proposal features across frames. Besides, we optimize the point cloud pooling by a voxel-based sampling technique so that millions of points can be processed in several milliseconds. The proposed MSF method achieves not only better efficiency than other multi-frame detectors but also leading accuracy, with 83.12% and 78.30% mAP on the LEVEL1 and LEVEL2 test sets of Waymo Open Dataset, respectively. Codes can be found at \url{https://github.com/skyhehe123/MSF}.

* Accepted by CVPR2023 
Viaarxiv icon

DynaMask: Dynamic Mask Selection for Instance Segmentation

Mar 14, 2023
Ruihuang Li, Chenhang He, Shuai Li, Yabin Zhang, Lei Zhang

Figure 1 for DynaMask: Dynamic Mask Selection for Instance Segmentation
Figure 2 for DynaMask: Dynamic Mask Selection for Instance Segmentation
Figure 3 for DynaMask: Dynamic Mask Selection for Instance Segmentation
Figure 4 for DynaMask: Dynamic Mask Selection for Instance Segmentation

The representative instance segmentation methods mostly segment different object instances with a mask of the fixed resolution, e.g., 28*28 grid. However, a low-resolution mask loses rich details, while a high-resolution mask incurs quadratic computation overhead. It is a challenging task to predict the optimal binary mask for each instance. In this paper, we propose to dynamically select suitable masks for different object proposals. First, a dual-level Feature Pyramid Network (FPN) with adaptive feature aggregation is developed to gradually increase the mask grid resolution, ensuring high-quality segmentation of objects. Specifically, an efficient region-level top-down path (r-FPN) is introduced to incorporate complementary contextual and detailed information from different stages of image-level FPN (i-FPN). Then, to alleviate the increase of computation and memory costs caused by using large masks, we develop a Mask Switch Module (MSM) with negligible computational cost to select the most suitable mask resolution for each instance, achieving high efficiency while maintaining high segmentation accuracy. Without bells and whistles, the proposed method, namely DynaMask, brings consistent and noticeable performance improvements over other state-of-the-arts at a moderate computation overhead. The source code: https://github.com/lslrh/DynaMask.

* Accepted by CVPR2023 
Viaarxiv icon

SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation

Mar 14, 2023
Ruihuang Li, Chenhang He, Yabin Zhang, Shuai Li, Liyi Chen, Lei Zhang

Figure 1 for SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation
Figure 2 for SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation
Figure 3 for SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation
Figure 4 for SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation

Weakly supervised instance segmentation using only bounding box annotations has recently attracted much research attention. Most of the current efforts leverage low-level image features as extra supervision without explicitly exploiting the high-level semantic information of the objects, which will become ineffective when the foreground objects have similar appearances to the background or other objects nearby. We propose a new box-supervised instance segmentation approach by developing a Semantic-aware Instance Mask (SIM) generation paradigm. Instead of heavily relying on local pair-wise affinities among neighboring pixels, we construct a group of category-wise feature centroids as prototypes to identify foreground objects and assign them semantic-level pseudo labels. Considering that the semantic-aware prototypes cannot distinguish different instances of the same semantics, we propose a self-correction mechanism to rectify the falsely activated regions while enhancing the correct ones. Furthermore, to handle the occlusions between objects, we tailor the Copy-Paste operation for the weakly-supervised instance segmentation task to augment challenging training data. Extensive experimental results demonstrate the superiority of our proposed SIM approach over other state-of-the-art methods. The source code: https://github.com/lslrh/SIM.

* Accepted by CVPR2023 
Viaarxiv icon

Masked Surfel Prediction for Self-Supervised Point Cloud Learning

Jul 07, 2022
Yabin Zhang, Jiehong Lin, Chenhang He, Yongwei Chen, Kui Jia, Lei Zhang

Figure 1 for Masked Surfel Prediction for Self-Supervised Point Cloud Learning
Figure 2 for Masked Surfel Prediction for Self-Supervised Point Cloud Learning
Figure 3 for Masked Surfel Prediction for Self-Supervised Point Cloud Learning
Figure 4 for Masked Surfel Prediction for Self-Supervised Point Cloud Learning

Masked auto-encoding is a popular and effective self-supervised learning approach to point cloud learning. However, most of the existing methods reconstruct only the masked points and overlook the local geometry information, which is also important to understand the point cloud data. In this work, we make the first attempt, to the best of our knowledge, to consider the local geometry information explicitly into the masked auto-encoding, and propose a novel Masked Surfel Prediction (MaskSurf) method. Specifically, given the input point cloud masked at a high ratio, we learn a transformer-based encoder-decoder network to estimate the underlying masked surfels by simultaneously predicting the surfel positions (i.e., points) and per-surfel orientations (i.e., normals). The predictions of points and normals are supervised by the Chamfer Distance and a newly introduced Position-Indexed Normal Distance in a set-to-set manner. Our MaskSurf is validated on six downstream tasks under three fine-tuning strategies. In particular, MaskSurf outperforms its closest competitor, Point-MAE, by 1.2\% on the real-world dataset of ScanObjectNN under the OBJ-BG setting, justifying the advantages of masked surfel prediction over masked point cloud reconstruction. Codes will be available at https://github.com/YBZh/MaskSurf.

* Codes will be available at https://github.com/YBZh/MaskSurf 
Viaarxiv icon

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds

Mar 19, 2022
Chenhang He, Ruihuang Li, Shuai Li, Lei Zhang

Figure 1 for Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds
Figure 2 for Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds
Figure 3 for Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds
Figure 4 for Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds

Transformer has demonstrated promising performance in many 2D vision tasks. However, it is cumbersome to compute the self-attention on large-scale point cloud data because point cloud is a long sequence and unevenly distributed in 3D space. To solve this issue, existing methods usually compute self-attention locally by grouping the points into clusters of the same size, or perform convolutional self-attention on a discretized representation. However, the former results in stochastic point dropout, while the latter typically has narrow attention fields. In this paper, we propose a novel voxel-based architecture, namely Voxel Set Transformer (VoxSeT), to detect 3D objects from point clouds by means of set-to-set translation. VoxSeT is built upon a voxel-based set attention (VSA) module, which reduces the self-attention in each voxel by two cross-attentions and models features in a hidden space induced by a group of latent codes. With the VSA module, VoxSeT can manage voxelized point clusters with arbitrary size in a wide range, and process them in parallel with linear complexity. The proposed VoxSeT integrates the high performance of transformer with the efficiency of voxel-based model, which can be used as a good alternative to the convolutional and point-based backbones. VoxSeT reports competitive results on the KITTI and Waymo detection benchmarks. The source codes can be found at \url{https://github.com/skyhehe123/VoxSeT}.

* 11 pages, 4 figures, CVPR2022 
Viaarxiv icon

Class-Balanced Pixel-Level Self-Labeling for Domain Adaptive Semantic Segmentation

Mar 18, 2022
Ruihuang Li, Shuai Li, Chenhang He, Yabin Zhang, Xu Jia, Lei Zhang

Figure 1 for Class-Balanced Pixel-Level Self-Labeling for Domain Adaptive Semantic Segmentation
Figure 2 for Class-Balanced Pixel-Level Self-Labeling for Domain Adaptive Semantic Segmentation
Figure 3 for Class-Balanced Pixel-Level Self-Labeling for Domain Adaptive Semantic Segmentation
Figure 4 for Class-Balanced Pixel-Level Self-Labeling for Domain Adaptive Semantic Segmentation

Domain adaptive semantic segmentation aims to learn a model with the supervision of source domain data, and produce satisfactory dense predictions on unlabeled target domain. One popular solution to this challenging task is self-training, which selects high-scoring predictions on target samples as pseudo labels for training. However, the produced pseudo labels often contain much noise because the model is biased to source domain as well as majority categories. To address the above issues, we propose to directly explore the intrinsic pixel distributions of target domain data, instead of heavily relying on the source domain. Specifically, we simultaneously cluster pixels and rectify pseudo labels with the obtained cluster assignments. This process is done in an online fashion so that pseudo labels could co-evolve with the segmentation model without extra training rounds. To overcome the class imbalance problem on long-tailed categories, we employ a distribution alignment technique to enforce the marginal class distribution of cluster assignments to be close to that of pseudo labels. The proposed method, namely Class-balanced Pixel-level Self-Labeling (CPSL), improves the segmentation performance on target domain over state-of-the-arts by a large margin, especially on long-tailed categories.

* This paper has been accepted by CVPR 2022 
Viaarxiv icon

A Dual Weighting Label Assignment Scheme for Object Detection

Mar 18, 2022
Shuai Li, Chenhang He, Ruihuang Li, Lei Zhang

Figure 1 for A Dual Weighting Label Assignment Scheme for Object Detection
Figure 2 for A Dual Weighting Label Assignment Scheme for Object Detection
Figure 3 for A Dual Weighting Label Assignment Scheme for Object Detection
Figure 4 for A Dual Weighting Label Assignment Scheme for Object Detection

Label assignment (LA), which aims to assign each training sample a positive (pos) and a negative (neg) loss weight, plays an important role in object detection. Existing LA methods mostly focus on the design of pos weighting function, while the neg weight is directly derived from the pos weight. Such a mechanism limits the learning capacity of detectors. In this paper, we explore a new weighting paradigm, termed dual weighting (DW), to specify pos and neg weights separately. We first identify the key influential factors of pos/neg weights by analyzing the evaluation metrics in object detection, and then design the pos and neg weighting functions based on them. Specifically, the pos weight of a sample is determined by the consistency degree between its classification and localization scores, while the neg weight is decomposed into two terms: the probability that it is a neg sample and its importance conditioned on being a neg sample. Such a weighting strategy offers greater flexibility to distinguish between important and less important samples, resulting in a more effective object detector. Equipped with the proposed DW method, a single FCOS-ResNet-50 detector can reach 41.5% mAP on COCO under 1x schedule, outperforming other existing LA methods. It consistently improves the baselines on COCO by a large margin under various backbones without bells and whistles. Code is available at https://github.com/strongwolf/DW.

* Accepted by CVPR2022 
Viaarxiv icon