Alert button
Picture for Ruihuang Li

Ruihuang Li

Alert button

One-to-Few Label Assignment for End-to-End Dense Detection

Mar 21, 2023
Shuai Li, Minghan Li, Ruihuang Li, Chenhang He, Lei Zhang

Figure 1 for One-to-Few Label Assignment for End-to-End Dense Detection
Figure 2 for One-to-Few Label Assignment for End-to-End Dense Detection
Figure 3 for One-to-Few Label Assignment for End-to-End Dense Detection
Figure 4 for One-to-Few Label Assignment for End-to-End Dense Detection

One-to-one (o2o) label assignment plays a key role for transformer based end-to-end detection, and it has been recently introduced in fully convolutional detectors for end-to-end dense detection. However, o2o can degrade the feature learning efficiency due to the limited number of positive samples. Though extra positive samples are introduced to mitigate this issue in recent DETRs, the computation of self- and cross- attentions in the decoder limits its practical application to dense and fully convolutional detectors. In this work, we propose a simple yet effective one-to-few (o2f) label assignment strategy for end-to-end dense detection. Apart from defining one positive and many negative anchors for each object, we define several soft anchors, which serve as positive and negative samples simultaneously. The positive and negative weights of these soft anchors are dynamically adjusted during training so that they can contribute more to ``representation learning'' in the early training stage, and contribute more to ``duplicated prediction removal'' in the later stage. The detector trained in this way can not only learn a strong feature representation but also perform end-to-end dense detection. Experiments on COCO and CrowdHuman datasets demonstrate the effectiveness of the o2f scheme. Code is available at https://github.com/strongwolf/o2f.

* Accepted by CVPR2023 
Viaarxiv icon

MSF: Motion-guided Sequential Fusion for Efficient 3D Object Detection from Point Cloud Sequences

Mar 15, 2023
Chenhang He, Ruihuang Li, Yabin Zhang, Shuai Li, Lei Zhang

Figure 1 for MSF: Motion-guided Sequential Fusion for Efficient 3D Object Detection from Point Cloud Sequences
Figure 2 for MSF: Motion-guided Sequential Fusion for Efficient 3D Object Detection from Point Cloud Sequences
Figure 3 for MSF: Motion-guided Sequential Fusion for Efficient 3D Object Detection from Point Cloud Sequences
Figure 4 for MSF: Motion-guided Sequential Fusion for Efficient 3D Object Detection from Point Cloud Sequences

Point cloud sequences are commonly used to accurately detect 3D objects in applications such as autonomous driving. Current top-performing multi-frame detectors mostly follow a Detect-and-Fuse framework, which extracts features from each frame of the sequence and fuses them to detect the objects in the current frame. However, this inevitably leads to redundant computation since adjacent frames are highly correlated. In this paper, we propose an efficient Motion-guided Sequential Fusion (MSF) method, which exploits the continuity of object motion to mine useful sequential contexts for object detection in the current frame. We first generate 3D proposals on the current frame and propagate them to preceding frames based on the estimated velocities. The points-of-interest are then pooled from the sequence and encoded as proposal features. A novel Bidirectional Feature Aggregation (BiFA) module is further proposed to facilitate the interactions of proposal features across frames. Besides, we optimize the point cloud pooling by a voxel-based sampling technique so that millions of points can be processed in several milliseconds. The proposed MSF method achieves not only better efficiency than other multi-frame detectors but also leading accuracy, with 83.12% and 78.30% mAP on the LEVEL1 and LEVEL2 test sets of Waymo Open Dataset, respectively. Codes can be found at \url{https://github.com/skyhehe123/MSF}.

* Accepted by CVPR2023 
Viaarxiv icon

DynaMask: Dynamic Mask Selection for Instance Segmentation

Mar 14, 2023
Ruihuang Li, Chenhang He, Shuai Li, Yabin Zhang, Lei Zhang

Figure 1 for DynaMask: Dynamic Mask Selection for Instance Segmentation
Figure 2 for DynaMask: Dynamic Mask Selection for Instance Segmentation
Figure 3 for DynaMask: Dynamic Mask Selection for Instance Segmentation
Figure 4 for DynaMask: Dynamic Mask Selection for Instance Segmentation

The representative instance segmentation methods mostly segment different object instances with a mask of the fixed resolution, e.g., 28*28 grid. However, a low-resolution mask loses rich details, while a high-resolution mask incurs quadratic computation overhead. It is a challenging task to predict the optimal binary mask for each instance. In this paper, we propose to dynamically select suitable masks for different object proposals. First, a dual-level Feature Pyramid Network (FPN) with adaptive feature aggregation is developed to gradually increase the mask grid resolution, ensuring high-quality segmentation of objects. Specifically, an efficient region-level top-down path (r-FPN) is introduced to incorporate complementary contextual and detailed information from different stages of image-level FPN (i-FPN). Then, to alleviate the increase of computation and memory costs caused by using large masks, we develop a Mask Switch Module (MSM) with negligible computational cost to select the most suitable mask resolution for each instance, achieving high efficiency while maintaining high segmentation accuracy. Without bells and whistles, the proposed method, namely DynaMask, brings consistent and noticeable performance improvements over other state-of-the-arts at a moderate computation overhead. The source code: https://github.com/lslrh/DynaMask.

* Accepted by CVPR2023 
Viaarxiv icon

SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation

Mar 14, 2023
Ruihuang Li, Chenhang He, Yabin Zhang, Shuai Li, Liyi Chen, Lei Zhang

Figure 1 for SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation
Figure 2 for SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation
Figure 3 for SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation
Figure 4 for SIM: Semantic-aware Instance Mask Generation for Box-Supervised Instance Segmentation

Weakly supervised instance segmentation using only bounding box annotations has recently attracted much research attention. Most of the current efforts leverage low-level image features as extra supervision without explicitly exploiting the high-level semantic information of the objects, which will become ineffective when the foreground objects have similar appearances to the background or other objects nearby. We propose a new box-supervised instance segmentation approach by developing a Semantic-aware Instance Mask (SIM) generation paradigm. Instead of heavily relying on local pair-wise affinities among neighboring pixels, we construct a group of category-wise feature centroids as prototypes to identify foreground objects and assign them semantic-level pseudo labels. Considering that the semantic-aware prototypes cannot distinguish different instances of the same semantics, we propose a self-correction mechanism to rectify the falsely activated regions while enhancing the correct ones. Furthermore, to handle the occlusions between objects, we tailor the Copy-Paste operation for the weakly-supervised instance segmentation task to augment challenging training data. Extensive experimental results demonstrate the superiority of our proposed SIM approach over other state-of-the-art methods. The source code: https://github.com/lslrh/SIM.

* Accepted by CVPR2023 
Viaarxiv icon

Adversarial Style Augmentation for Domain Generalization

Jan 30, 2023
Yabin Zhang, Bin Deng, Ruihuang Li, Kui Jia, Lei Zhang

Figure 1 for Adversarial Style Augmentation for Domain Generalization
Figure 2 for Adversarial Style Augmentation for Domain Generalization
Figure 3 for Adversarial Style Augmentation for Domain Generalization
Figure 4 for Adversarial Style Augmentation for Domain Generalization

It is well-known that the performance of well-trained deep neural networks may degrade significantly when they are applied to data with even slightly shifted distributions. Recent studies have shown that introducing certain perturbation on feature statistics (\eg, mean and standard deviation) during training can enhance the cross-domain generalization ability. Existing methods typically conduct such perturbation by utilizing the feature statistics within a mini-batch, limiting their representation capability. Inspired by the domain generalization objective, we introduce a novel Adversarial Style Augmentation (ASA) method, which explores broader style spaces by generating more effective statistics perturbation via adversarial training. Specifically, we first search for the most sensitive direction and intensity for statistics perturbation by maximizing the task loss. By updating the model against the adversarial statistics perturbation during training, we allow the model to explore the worst-case domain and hence improve its generalization performance. To facilitate the application of ASA, we design a simple yet effective module, namely AdvStyle, which instantiates the ASA method in a plug-and-play manner. We justify the efficacy of AdvStyle on tasks of cross-domain classification and instance retrieval. It achieves higher mean accuracy and lower performance fluctuation. Especially, our method significantly outperforms its competitors on the PACS dataset under the single source generalization setting, \eg, boosting the classification accuracy from 61.2\% to 67.1\% with a ResNet50 backbone. Our code will be available at \url{https://github.com/YBZh/AdvStyle}.

* Initially finished in March 2022; Code will be available at \url{https://github.com/YBZh/AdvStyle} 
Viaarxiv icon

Point-DAE: Denoising Autoencoders for Self-supervised Point Cloud Learning

Nov 13, 2022
Yabin Zhang, Jiehong Lin, Ruihuang Li, Kui Jia, Lei Zhang

Figure 1 for Point-DAE: Denoising Autoencoders for Self-supervised Point Cloud Learning
Figure 2 for Point-DAE: Denoising Autoencoders for Self-supervised Point Cloud Learning
Figure 3 for Point-DAE: Denoising Autoencoders for Self-supervised Point Cloud Learning
Figure 4 for Point-DAE: Denoising Autoencoders for Self-supervised Point Cloud Learning

Masked autoencoder has demonstrated its effectiveness in self-supervised point cloud learning. Considering that masking is a kind of corruption, in this work we explore a more general denoising autoencoder for point cloud learning (Point-DAE) by investigating more types of corruptions beyond masking. Specifically, we degrade the point cloud with certain corruptions as input, and learn an encoder-decoder model to reconstruct the original point cloud from its corrupted version. Three corruption families (i.e., density/masking, noise, and affine transformation) and a total of fourteen corruption types are investigated. Interestingly, the affine transformation-based Point-DAE generally outperforms others (e.g., the popular masking corruptions), suggesting a promising direction for self-supervised point cloud learning. More importantly, we find a statistically significant linear relationship between task relatedness and model performance on downstream tasks. This finding partly demystifies the advantage of affine transformation-based Point-DAE, given that such Point-DAE variants are closely related to the downstream classification task. Additionally, we reveal that most Point-DAE variants unintentionally benefit from the manually-annotated canonical poses in the pre-training dataset. To tackle such an issue, we promote a new dataset setting by estimating object poses automatically. The codes will be available at \url{https://github.com/YBZh/Point-DAE.}

* Codes will be available at https://github.com/YBZh/Point-DAE 
Viaarxiv icon

Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling

Apr 14, 2022
Takashi Isobe, Xu Jia, Xin Tao, Changlin Li, Ruihuang Li, Yongjie Shi, Jing Mu, Huchuan Lu, Yu-Wing Tai

Figure 1 for Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling
Figure 2 for Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling
Figure 3 for Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling
Figure 4 for Look Back and Forth: Video Super-Resolution with Explicit Temporal Difference Modeling

Temporal modeling is crucial for video super-resolution. Most of the video super-resolution methods adopt the optical flow or deformable convolution for explicitly motion compensation. However, such temporal modeling techniques increase the model complexity and might fail in case of occlusion or complex motion, resulting in serious distortion and artifacts. In this paper, we propose to explore the role of explicit temporal difference modeling in both LR and HR space. Instead of directly feeding consecutive frames into a VSR model, we propose to compute the temporal difference between frames and divide those pixels into two subsets according to the level of difference. They are separately processed with two branches of different receptive fields in order to better extract complementary information. To further enhance the super-resolution result, not only spatial residual features are extracted, but the difference between consecutive frames in high-frequency domain is also computed. It allows the model to exploit intermediate SR results in both future and past to refine the current SR output. The difference at different time steps could be cached such that information from further distance in time could be propagated to the current frame for refinement. Experiments on several video super-resolution benchmark datasets demonstrate the effectiveness of the proposed method and its favorable performance against state-of-the-art methods.

* CVPR 2022 
Viaarxiv icon

Exact Feature Distribution Matching for Arbitrary Style Transfer and Domain Generalization

Mar 25, 2022
Yabin Zhang, Minghan Li, Ruihuang Li, Kui Jia, Lei Zhang

Figure 1 for Exact Feature Distribution Matching for Arbitrary Style Transfer and Domain Generalization
Figure 2 for Exact Feature Distribution Matching for Arbitrary Style Transfer and Domain Generalization
Figure 3 for Exact Feature Distribution Matching for Arbitrary Style Transfer and Domain Generalization
Figure 4 for Exact Feature Distribution Matching for Arbitrary Style Transfer and Domain Generalization

Arbitrary style transfer (AST) and domain generalization (DG) are important yet challenging visual learning tasks, which can be cast as a feature distribution matching problem. With the assumption of Gaussian feature distribution, conventional feature distribution matching methods usually match the mean and standard deviation of features. However, the feature distributions of real-world data are usually much more complicated than Gaussian, which cannot be accurately matched by using only the first-order and second-order statistics, while it is computationally prohibitive to use high-order statistics for distribution matching. In this work, we, for the first time to our best knowledge, propose to perform Exact Feature Distribution Matching (EFDM) by exactly matching the empirical Cumulative Distribution Functions (eCDFs) of image features, which could be implemented by applying the Exact Histogram Matching (EHM) in the image feature space. Particularly, a fast EHM algorithm, named Sort-Matching, is employed to perform EFDM in a plug-and-play manner with minimal cost. The effectiveness of our proposed EFDM method is verified on a variety of AST and DG tasks, demonstrating new state-of-the-art results. Codes are available at https://github.com/YBZh/EFDM.

* CVPR2022 camera ready  
* To appear in CVPR2022; codes and supplementary material are available at: https://github.com/YBZh/EFDM 
Viaarxiv icon

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds

Mar 19, 2022
Chenhang He, Ruihuang Li, Shuai Li, Lei Zhang

Figure 1 for Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds
Figure 2 for Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds
Figure 3 for Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds
Figure 4 for Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds

Transformer has demonstrated promising performance in many 2D vision tasks. However, it is cumbersome to compute the self-attention on large-scale point cloud data because point cloud is a long sequence and unevenly distributed in 3D space. To solve this issue, existing methods usually compute self-attention locally by grouping the points into clusters of the same size, or perform convolutional self-attention on a discretized representation. However, the former results in stochastic point dropout, while the latter typically has narrow attention fields. In this paper, we propose a novel voxel-based architecture, namely Voxel Set Transformer (VoxSeT), to detect 3D objects from point clouds by means of set-to-set translation. VoxSeT is built upon a voxel-based set attention (VSA) module, which reduces the self-attention in each voxel by two cross-attentions and models features in a hidden space induced by a group of latent codes. With the VSA module, VoxSeT can manage voxelized point clusters with arbitrary size in a wide range, and process them in parallel with linear complexity. The proposed VoxSeT integrates the high performance of transformer with the efficiency of voxel-based model, which can be used as a good alternative to the convolutional and point-based backbones. VoxSeT reports competitive results on the KITTI and Waymo detection benchmarks. The source codes can be found at \url{https://github.com/skyhehe123/VoxSeT}.

* 11 pages, 4 figures, CVPR2022 
Viaarxiv icon

Class-Balanced Pixel-Level Self-Labeling for Domain Adaptive Semantic Segmentation

Mar 18, 2022
Ruihuang Li, Shuai Li, Chenhang He, Yabin Zhang, Xu Jia, Lei Zhang

Figure 1 for Class-Balanced Pixel-Level Self-Labeling for Domain Adaptive Semantic Segmentation
Figure 2 for Class-Balanced Pixel-Level Self-Labeling for Domain Adaptive Semantic Segmentation
Figure 3 for Class-Balanced Pixel-Level Self-Labeling for Domain Adaptive Semantic Segmentation
Figure 4 for Class-Balanced Pixel-Level Self-Labeling for Domain Adaptive Semantic Segmentation

Domain adaptive semantic segmentation aims to learn a model with the supervision of source domain data, and produce satisfactory dense predictions on unlabeled target domain. One popular solution to this challenging task is self-training, which selects high-scoring predictions on target samples as pseudo labels for training. However, the produced pseudo labels often contain much noise because the model is biased to source domain as well as majority categories. To address the above issues, we propose to directly explore the intrinsic pixel distributions of target domain data, instead of heavily relying on the source domain. Specifically, we simultaneously cluster pixels and rectify pseudo labels with the obtained cluster assignments. This process is done in an online fashion so that pseudo labels could co-evolve with the segmentation model without extra training rounds. To overcome the class imbalance problem on long-tailed categories, we employ a distribution alignment technique to enforce the marginal class distribution of cluster assignments to be close to that of pseudo labels. The proposed method, namely Class-balanced Pixel-level Self-Labeling (CPSL), improves the segmentation performance on target domain over state-of-the-arts by a large margin, especially on long-tailed categories.

* This paper has been accepted by CVPR 2022 
Viaarxiv icon