Alert button
Picture for Qi She

Qi She

Alert button

PDO-s3DCNNs: Partial Differential Operator Based Steerable 3D CNNs

Aug 07, 2022
Zhengyang Shen, Tao Hong, Qi She, Jinwen Ma, Zhouchen Lin

Figure 1 for PDO-s3DCNNs: Partial Differential Operator Based Steerable 3D CNNs
Figure 2 for PDO-s3DCNNs: Partial Differential Operator Based Steerable 3D CNNs
Figure 3 for PDO-s3DCNNs: Partial Differential Operator Based Steerable 3D CNNs
Figure 4 for PDO-s3DCNNs: Partial Differential Operator Based Steerable 3D CNNs

Steerable models can provide very general and flexible equivariance by formulating equivariance requirements in the language of representation theory and feature fields, which has been recognized to be effective for many vision tasks. However, deriving steerable models for 3D rotations is much more difficult than that in the 2D case, due to more complicated mathematics of 3D rotations. In this work, we employ partial differential operators (PDOs) to model 3D filters, and derive general steerable 3D CNNs, which are called PDO-s3DCNNs. We prove that the equivariant filters are subject to linear constraints, which can be solved efficiently under various conditions. As far as we know, PDO-s3DCNNs are the most general steerable CNNs for 3D rotations, in the sense that they cover all common subgroups of $SO(3)$ and their representations, while existing methods can only be applied to specific groups and representations. Extensive experiments show that our models can preserve equivariance well in the discrete domain, and outperform previous works on SHREC'17 retrieval and ISBI 2012 segmentation tasks with a low network complexity.

* accepted by ICML2022 
Viaarxiv icon

On Learning Contrastive Representations for Learning with Noisy Labels

Mar 25, 2022
Li Yi, Sheng Liu, Qi She, A. Ian McLeod, Boyu Wang

Figure 1 for On Learning Contrastive Representations for Learning with Noisy Labels
Figure 2 for On Learning Contrastive Representations for Learning with Noisy Labels
Figure 3 for On Learning Contrastive Representations for Learning with Noisy Labels
Figure 4 for On Learning Contrastive Representations for Learning with Noisy Labels

Deep neural networks are able to memorize noisy labels easily with a softmax cross-entropy (CE) loss. Previous studies attempted to address this issue focus on incorporating a noise-robust loss function to the CE loss. However, the memorization issue is alleviated but still remains due to the non-robust CE loss. To address this issue, we focus on learning robust contrastive representations of data on which the classifier is hard to memorize the label noise under the CE loss. We propose a novel contrastive regularization function to learn such representations over noisy data where label noise does not dominate the representation learning. By theoretically investigating the representations induced by the proposed regularization function, we reveal that the learned representations keep information related to true labels and discard information related to corrupted labels. Moreover, our theoretical results also indicate that the learned representations are robust to the label noise. The effectiveness of this method is demonstrated with experiments on benchmark datasets.

Viaarxiv icon

Weakly Supervised Object Localization as Domain Adaption

Mar 25, 2022
Lei Zhu, Qi She, Qian Chen, Yunfei You, Boyu Wang, Yanye Lu

Figure 1 for Weakly Supervised Object Localization as Domain Adaption
Figure 2 for Weakly Supervised Object Localization as Domain Adaption
Figure 3 for Weakly Supervised Object Localization as Domain Adaption
Figure 4 for Weakly Supervised Object Localization as Domain Adaption

Weakly supervised object localization (WSOL) focuses on localizing objects only with the supervision of image-level classification masks. Most previous WSOL methods follow the classification activation map (CAM) that localizes objects based on the classification structure with the multi-instance learning (MIL) mechanism. However, the MIL mechanism makes CAM only activate discriminative object parts rather than the whole object, weakening its performance for localizing objects. To avoid this problem, this work provides a novel perspective that models WSOL as a domain adaption (DA) task, where the score estimator trained on the source/image domain is tested on the target/pixel domain to locate objects. Under this perspective, a DA-WSOL pipeline is designed to better engage DA approaches into WSOL to enhance localization performance. It utilizes a proposed target sampling strategy to select different types of target samples. Based on these types of target samples, domain adaption localization (DAL) loss is elaborated. It aligns the feature distribution between the two domains by DA and makes the estimator perceive target domain cues by Universum regularization. Experiments show that our pipeline outperforms SOTA methods on multi benchmarks. Code are released at \url{https://github.com/zh460045050/DA-WSOL_CVPR2022}.

* Accept by CVPR 2022 Conference 
Viaarxiv icon

Background-aware Classification Activation Map for Weakly Supervised Object Localization

Dec 29, 2021
Lei Zhu, Qi She, Qian Chen, Xiangxi Meng, Mufeng Geng, Lujia Jin, Zhe Jiang, Bin Qiu, Yunfei You, Yibao Zhang, Qiushi Ren, Yanye Lu

Figure 1 for Background-aware Classification Activation Map for Weakly Supervised Object Localization
Figure 2 for Background-aware Classification Activation Map for Weakly Supervised Object Localization
Figure 3 for Background-aware Classification Activation Map for Weakly Supervised Object Localization
Figure 4 for Background-aware Classification Activation Map for Weakly Supervised Object Localization

Weakly supervised object localization (WSOL) relaxes the requirement of dense annotations for object localization by using image-level classification masks to supervise its learning process. However, current WSOL methods suffer from excessive activation of background locations and need post-processing to obtain the localization mask. This paper attributes these issues to the unawareness of background cues, and propose the background-aware classification activation map (B-CAM) to simultaneously learn localization scores of both object and background with only image-level labels. In our B-CAM, two image-level features, aggregated by pixel-level features of potential background and object locations, are used to purify the object feature from the object-related background and to represent the feature of the pure-background sample, respectively. Then based on these two features, both the object classifier and the background classifier are learned to determine the binary object localization mask. Our B-CAM can be trained in end-to-end manner based on a proposed stagger classification loss, which not only improves the objects localization but also suppresses the background activation. Experiments show that our B-CAM outperforms one-stage WSOL methods on the CUB-200, OpenImages and VOC2012 datasets.

Viaarxiv icon

Learning from Temporal Gradient for Semi-supervised Action Recognition

Dec 06, 2021
Junfei Xiao, Longlong Jing, Lin Zhang, Ju He, Qi She, Zongwei Zhou, Alan Yuille, Yingwei Li

Figure 1 for Learning from Temporal Gradient for Semi-supervised Action Recognition
Figure 2 for Learning from Temporal Gradient for Semi-supervised Action Recognition
Figure 3 for Learning from Temporal Gradient for Semi-supervised Action Recognition
Figure 4 for Learning from Temporal Gradient for Semi-supervised Action Recognition

Semi-supervised video action recognition tends to enable deep neural networks to achieve remarkable performance even with very limited labeled data. However, existing methods are mainly transferred from current image-based methods (e.g., FixMatch). Without specifically utilizing the temporal dynamics and inherent multimodal attributes, their results could be suboptimal. To better leverage the encoded temporal information in videos, we introduce temporal gradient as an additional modality for more attentive feature extraction in this paper. To be specific, our method explicitly distills the fine-grained motion representations from temporal gradient (TG) and imposes consistency across different modalities (i.e., RGB and TG). The performance of semi-supervised action recognition is significantly improved without additional computation or parameters during inference. Our method achieves the state-of-the-art performance on three video action recognition benchmarks (i.e., Kinetics-400, UCF-101, and HMDB-51) under several typical semi-supervised settings (i.e., different ratios of labeled data).

Viaarxiv icon

TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial Decoding

Oct 17, 2021
Zhengwei Wang, Qi She, Aljosa Smolic

Figure 1 for TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial Decoding
Figure 2 for TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial Decoding
Figure 3 for TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial Decoding
Figure 4 for TEAM-Net: Multi-modal Learning for Video Action Recognition with Partial Decoding

Most of existing video action recognition models ingest raw RGB frames. However, the raw video stream requires enormous storage and contains significant temporal redundancy. Video compression (e.g., H.264, MPEG-4) reduces superfluous information by representing the raw video stream using the concept of Group of Pictures (GOP). Each GOP is composed of the first I-frame (aka RGB image) followed by a number of P-frames, represented by motion vectors and residuals, which can be regarded and used as pre-extracted features. In this work, we 1) introduce sampling the input for the network from partially decoded videos based on the GOP-level, and 2) propose a plug-and-play mulTi-modal lEArning Module (TEAM) for training the network using information from I-frames and P-frames in an end-to-end manner. We demonstrate the superior performance of TEAM-Net compared to the baseline using RGB only. TEAM-Net also achieves the state-of-the-art performance in the area of video action recognition with partial decoding. Code is provided at https://github.com/villawang/TEAM-Net.

* To appear in BMVC 2021 
Viaarxiv icon