Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chen Qian

3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal

Jul 22, 2022
Hao Meng, Sheng Jin, Wentao Liu, Chen Qian, Mengxiang Lin, Wanli Ouyang, Ping Luo

Figure 1 for 3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal

Figure 2 for 3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal

Figure 3 for 3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal

Figure 4 for 3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal

Estimating 3D interacting hand pose from a single RGB image is essential for understanding human actions. Unlike most previous works that directly predict the 3D poses of two interacting hands simultaneously, we propose to decompose the challenging interacting hand pose estimation task and estimate the pose of each hand separately. In this way, it is straightforward to take advantage of the latest research progress on the single-hand pose estimation system. However, hand pose estimation in interacting scenarios is very challenging, due to (1) severe hand-hand occlusion and (2) ambiguity caused by the homogeneous appearance of hands. To tackle these two challenges, we propose a novel Hand De-occlusion and Removal (HDR) framework to perform hand de-occlusion and distractor removal. We also propose the first large-scale synthetic amodal hand dataset, termed Amodal InterHand Dataset (AIH), to facilitate model training and promote the development of the related research. Experiments show that the proposed method significantly outperforms previous state-of-the-art interacting hand pose estimation approaches. Codes and data are available at https://github.com/MengHao666/HDR.

* ECCV2022

Via

Access Paper or Ask Questions

Pose for Everything: Towards Category-Agnostic Pose Estimation

Jul 21, 2022
Lumin Xu, Sheng Jin, Wang Zeng, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo, Xiaogang Wang

Figure 1 for Pose for Everything: Towards Category-Agnostic Pose Estimation

Figure 2 for Pose for Everything: Towards Category-Agnostic Pose Estimation

Figure 3 for Pose for Everything: Towards Category-Agnostic Pose Estimation

Figure 4 for Pose for Everything: Towards Category-Agnostic Pose Estimation

Existing works on 2D pose estimation mainly focus on a certain category, e.g. human, animal, and vehicle. However, there are lots of application scenarios that require detecting the poses/keypoints of the unseen class of objects. In this paper, we introduce the task of Category-Agnostic Pose Estimation (CAPE), which aims to create a pose estimation model capable of detecting the pose of any class of object given only a few samples with keypoint definition. To achieve this goal, we formulate the pose estimation problem as a keypoint matching problem and design a novel CAPE framework, termed POse Matching Network (POMNet). A transformer-based Keypoint Interaction Module (KIM) is proposed to capture both the interactions among different keypoints and the relationship between the support and query images. We also introduce Multi-category Pose (MP-100) dataset, which is a 2D pose dataset of 100 object categories containing over 20K instances and is well-designed for developing CAPE algorithms. Experiments show that our method outperforms other baseline approaches by a large margin. Codes and data are available at https://github.com/luminxu/Pose-for-Everything.

* ECCV 2022 Oral

Via

Access Paper or Ask Questions

Dual Adaptive Transformations for Weakly Supervised Point Cloud Segmentation

Jul 19, 2022
Zhonghua Wu, Yicheng Wu, Guosheng Lin, Jianfei Cai, Chen Qian

Figure 1 for Dual Adaptive Transformations for Weakly Supervised Point Cloud Segmentation

Figure 2 for Dual Adaptive Transformations for Weakly Supervised Point Cloud Segmentation

Figure 3 for Dual Adaptive Transformations for Weakly Supervised Point Cloud Segmentation

Figure 4 for Dual Adaptive Transformations for Weakly Supervised Point Cloud Segmentation

Weakly supervised point cloud segmentation, i.e. semantically segmenting a point cloud with only a few labeled points in the whole 3D scene, is highly desirable due to the heavy burden of collecting abundant dense annotations for the model training. However, existing methods remain challenging to accurately segment 3D point clouds since limited annotated data may lead to insufficient guidance for label propagation to unlabeled data. Considering the smoothness-based methods have achieved promising progress, in this paper, we advocate applying the consistency constraint under various perturbations to effectively regularize unlabeled 3D points. Specifically, we propose a novel DAT (\textbf{D}ual \textbf{A}daptive \textbf{T}ransformations) model for weakly supervised point cloud segmentation, where the dual adaptive transformations are performed via an adversarial strategy at both point-level and region-level, aiming at enforcing the local and structural smoothness constraints on 3D point clouds. We evaluate our proposed DAT model with two popular backbones on the large-scale S3DIS and ScanNet-V2 datasets. Extensive experiments demonstrate that our model can effectively leverage the unlabeled 3D points and achieve significant performance gains on both datasets, setting new state-of-the-art performance for weakly supervised point cloud segmentation.

* ECCV 2022

Via

Access Paper or Ask Questions

Structure-aware Editable Morphable Model for 3D Facial Detail Animation and Manipulation

Jul 19, 2022
Jingwang Ling, Zhibo Wang, Ming Lu, Quan Wang, Chen Qian, Feng Xu

Figure 1 for Structure-aware Editable Morphable Model for 3D Facial Detail Animation and Manipulation

Figure 2 for Structure-aware Editable Morphable Model for 3D Facial Detail Animation and Manipulation

Figure 3 for Structure-aware Editable Morphable Model for 3D Facial Detail Animation and Manipulation

Figure 4 for Structure-aware Editable Morphable Model for 3D Facial Detail Animation and Manipulation

Morphable models are essential for the statistical modeling of 3D faces. Previous works on morphable models mostly focus on large-scale facial geometry but ignore facial details. This paper augments morphable models in representing facial details by learning a Structure-aware Editable Morphable Model (SEMM). SEMM introduces a detail structure representation based on the distance field of wrinkle lines, jointly modeled with detail displacements to establish better correspondences and enable intuitive manipulation of wrinkle structure. Besides, SEMM introduces two transformation modules to translate expression blendshape weights and age values into changes in latent space, allowing effective semantic detail editing while maintaining identity. Extensive experiments demonstrate that the proposed model compactly represents facial details, outperforms previous methods in expression animation qualitatively and quantitatively, and achieves effective age editing and wrinkle line editing of facial details. Code and model are available at https://github.com/gerwang/facial-detail-manipulation.

* ECCV 2022

Via

Access Paper or Ask Questions

ScaleNet: Searching for the Model to Scale

Jul 15, 2022
Jiyang Xie, Xiu Su, Shan You, Zhanyu Ma, Fei Wang, Chen Qian

Figure 1 for ScaleNet: Searching for the Model to Scale

Figure 2 for ScaleNet: Searching for the Model to Scale

Figure 3 for ScaleNet: Searching for the Model to Scale

Figure 4 for ScaleNet: Searching for the Model to Scale

Recently, community has paid increasing attention on model scaling and contributed to developing a model family with a wide spectrum of scales. Current methods either simply resort to a one-shot NAS manner to construct a non-structural and non-scalable model family or rely on a manual yet fixed scaling strategy to scale an unnecessarily best base model. In this paper, we bridge both two components and propose ScaleNet to jointly search base model and scaling strategy so that the scaled large model can have more promising performance. Concretely, we design a super-supernet to embody models with different spectrum of sizes (e.g., FLOPs). Then, the scaling strategy can be learned interactively with the base model via a Markov chain-based evolution algorithm and generalized to develop even larger models. To obtain a decent super-supernet, we design a hierarchical sampling strategy to enhance its training sufficiency and alleviate the disturbance. Experimental results show our scaled networks enjoy significant performance superiority on various FLOPs, but with at least 2.53x reduction on search cost. Codes are available at https://github.com/luminolx/ScaleNet.

* Accepted by ECCV2022

Via

Access Paper or Ask Questions

LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Jul 12, 2022
Tao Huang, Lang Huang, Shan You, Fei Wang, Chen Qian, Chang Xu

Figure 1 for LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Figure 2 for LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Figure 3 for LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Figure 4 for LightViT: Towards Light-Weight Convolution-Free Vision Transformers

Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural networks (CNNs) due to the lack of inductive bias. Recent works thus resort to convolutions as a plug-and-play module and embed them in various ViT counterparts. In this paper, we argue that the convolutional kernels perform information aggregation to connect all tokens; however, they would be actually unnecessary for light-weight ViTs if this explicit aggregation could function in a more homogeneous way. Inspired by this, we present LightViT as a new family of light-weight ViTs to achieve better accuracy-efficiency balance upon the pure transformer blocks without convolution. Concretely, we introduce a global yet efficient aggregation scheme into both self-attention and feed-forward network (FFN) of ViTs, where additional learnable tokens are introduced to capture global dependencies; and bi-dimensional channel and spatial attentions are imposed over token embeddings. Experiments show that our model achieves significant improvements on image classification, object detection, and semantic segmentation tasks. For example, our LightViT-T achieves 78.7% accuracy on ImageNet with only 0.7G FLOPs, outperforming PVTv2-B0 by 8.2% while 11% faster on GPU. Code is available at https://github.com/hunto/LightViT.

* 13 pages, 7 figures, 9 tables

Via

Access Paper or Ask Questions

HEAD: HEtero-Assists Distillation for Heterogeneous Object Detectors

Jul 12, 2022
Luting Wang, Xiaojie Li, Yue Liao, Zeren Jiang, Jianlong Wu, Fei Wang, Chen Qian, Si Liu

Figure 1 for HEAD: HEtero-Assists Distillation for Heterogeneous Object Detectors

Figure 2 for HEAD: HEtero-Assists Distillation for Heterogeneous Object Detectors

Figure 3 for HEAD: HEtero-Assists Distillation for Heterogeneous Object Detectors

Figure 4 for HEAD: HEtero-Assists Distillation for Heterogeneous Object Detectors

Conventional knowledge distillation (KD) methods for object detection mainly concentrate on homogeneous teacher-student detectors. However, the design of a lightweight detector for deployment is often significantly different from a high-capacity detector. Thus, we investigate KD among heterogeneous teacher-student pairs for a wide application. We observe that the core difficulty for heterogeneous KD (hetero-KD) is the significant semantic gap between the backbone features of heterogeneous detectors due to the different optimization manners. Conventional homogeneous KD (homo-KD) methods suffer from such a gap and are hard to directly obtain satisfactory performance for hetero-KD. In this paper, we propose the HEtero-Assists Distillation (HEAD) framework, leveraging heterogeneous detection heads as assistants to guide the optimization of the student detector to reduce this gap. In HEAD, the assistant is an additional detection head with the architecture homogeneous to the teacher head attached to the student backbone. Thus, a hetero-KD is transformed into a homo-KD, allowing efficient knowledge transfer from the teacher to the student. Moreover, we extend HEAD into a Teacher-Free HEAD (TF-HEAD) framework when a well-trained teacher detector is unavailable. Our method has achieved significant improvement compared to current detection KD methods. For example, on the MS-COCO dataset, TF-HEAD helps R18 RetinaNet achieve 33.9 mAP (+2.2), while HEAD further pushes the limit to 36.2 mAP (+4.5).

* ECCV 2022, Code: https://github.com/LutingWang/HEAD

Via

Access Paper or Ask Questions

Submission to Generic Event Boundary Detection Challenge@CVPR 2022: Local Context Modeling and Global Boundary Decoding Approach

Jun 30, 2022
Jiaqi Tang, Zhaoyang Liu, Jing Tan, Chen Qian, Wayne Wu, Limin Wang

Figure 1 for Submission to Generic Event Boundary Detection Challenge@CVPR 2022: Local Context Modeling and Global Boundary Decoding Approach

Figure 2 for Submission to Generic Event Boundary Detection Challenge@CVPR 2022: Local Context Modeling and Global Boundary Decoding Approach

Figure 3 for Submission to Generic Event Boundary Detection Challenge@CVPR 2022: Local Context Modeling and Global Boundary Decoding Approach

Generic event boundary detection (GEBD) is an important yet challenging task in video understanding, which aims at detecting the moments where humans naturally perceive event boundaries. In this paper, we present a local context modeling and global boundary decoding approach for GEBD task. Local context modeling sub-network is proposed to perceive diverse patterns of generic event boundaries, and it generates powerful video representations and reliable boundary confidence. Based on them, global boundary decoding sub-network is exploited to decode event boundaries from a global view. Our proposed method achieves 85.13% F1-score on Kinetics-GEBD testing set, which achieves a more than 22% F1-score boost compared to the baseline method. The code is available at https://github.com/JackyTown/GEBD_Challenge_CVPR2022.

* arXiv admin note: text overlap with arXiv:2112.04771

Via

Access Paper or Ask Questions

Masked Distillation with Receptive Tokens

May 29, 2022
Tao Huang, Yuan Zhang, Shan You, Fei Wang, Chen Qian, Jian Cao, Chang Xu

Figure 1 for Masked Distillation with Receptive Tokens

Figure 2 for Masked Distillation with Receptive Tokens

Figure 3 for Masked Distillation with Receptive Tokens

Figure 4 for Masked Distillation with Receptive Tokens

Distilling from the feature maps can be fairly effective for dense prediction tasks since both the feature discriminability and localization priors can be well transferred. However, not every pixel contributes equally to the performance, and a good student should learn from what really matters to the teacher. In this paper, we introduce a learnable embedding dubbed receptive token to localize those pixels of interests (PoIs) in the feature map, with a distillation mask generated via pixel-wise attention. Then the distillation will be performed on the mask via pixel-wise reconstruction. In this way, a distillation mask actually indicates a pattern of pixel dependencies within feature maps of teacher. We thus adopt multiple receptive tokens to investigate more sophisticated and informative pixel dependencies to further enhance the distillation. To obtain a group of masks, the receptive tokens are learned via the regular task loss but with teacher fixed, and we also leverage a Dice loss to enrich the diversity of learned masks. Our method dubbed MasKD is simple and practical, and needs no priors of tasks in application. Experiments show that our MasKD can achieve state-of-the-art performance consistently on object detection and semantic segmentation benchmarks. Code is available at: https://github.com/hunto/MasKD .

Via

Access Paper or Ask Questions

Green Hierarchical Vision Transformer for Masked Image Modeling

May 26, 2022
Lang Huang, Shan You, Mingkai Zheng, Fei Wang, Chen Qian, Toshihiko Yamasaki

Figure 1 for Green Hierarchical Vision Transformer for Masked Image Modeling

Figure 2 for Green Hierarchical Vision Transformer for Masked Image Modeling

Figure 3 for Green Hierarchical Vision Transformer for Masked Image Modeling

Figure 4 for Green Hierarchical Vision Transformer for Masked Image Modeling

We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), e.g., Swin Transformer, allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of two key components. First, for the window attention, we design a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. As a result, MIM now can work on hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs about 2.7$\times$ faster and reduce the GPU memory usage by 70%, while still enjoying competitive performance on ImageNet classification and the superiority on downstream COCO object detection benchmarks. Code and pre-trained models have been made publicly available at https://github.com/LayneH/GreenMIM.

* 16 pages, 7 figures, 3 tables, 3 algorithms

Via

Access Paper or Ask Questions