Alert button
Picture for Guannan Jiang

Guannan Jiang

Alert button

Pseudo-label Alignment for Semi-supervised Instance Segmentation

Aug 10, 2023
Jie Hu, Chen Chen, Liujuan Cao, Shengchuan Zhang, Annan Shu, Guannan Jiang, Rongrong Ji

Figure 1 for Pseudo-label Alignment for Semi-supervised Instance Segmentation
Figure 2 for Pseudo-label Alignment for Semi-supervised Instance Segmentation
Figure 3 for Pseudo-label Alignment for Semi-supervised Instance Segmentation
Figure 4 for Pseudo-label Alignment for Semi-supervised Instance Segmentation

Pseudo-labeling is significant for semi-supervised instance segmentation, which generates instance masks and classes from unannotated images for subsequent training. However, in existing pipelines, pseudo-labels that contain valuable information may be directly filtered out due to mismatches in class and mask quality. To address this issue, we propose a novel framework, called pseudo-label aligning instance segmentation (PAIS), in this paper. In PAIS, we devise a dynamic aligning loss (DALoss) that adjusts the weights of semi-supervised loss terms with varying class and mask score pairs. Through extensive experiments conducted on the COCO and Cityscapes datasets, we demonstrate that PAIS is a promising framework for semi-supervised instance segmentation, particularly in cases where labeled data is severely limited. Notably, with just 1\% labeled data, PAIS achieves 21.2 mAP (based on Mask-RCNN) and 19.9 mAP (based on K-Net) on the COCO dataset, outperforming the current state-of-the-art model, \ie, NoisyBoundary with 7.7 mAP, by a margin of over 12 points. Code is available at: \url{https://github.com/hujiecpp/PAIS}.

* ICCV 2023 
Viaarxiv icon

Improving Human-Object Interaction Detection via Virtual Image Learning

Aug 04, 2023
Shuman Fang, Shuai Liu, Jie Li, Guannan Jiang, Xianming Lin, Rongrong Ji

Figure 1 for Improving Human-Object Interaction Detection via Virtual Image Learning
Figure 2 for Improving Human-Object Interaction Detection via Virtual Image Learning
Figure 3 for Improving Human-Object Interaction Detection via Virtual Image Learning
Figure 4 for Improving Human-Object Interaction Detection via Virtual Image Learning

Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects, which plays a curtail role in high-level semantic understanding tasks. However, most works pursue designing better architectures to learn overall features more efficiently, while ignoring the long-tail nature of interaction-object pair categories. In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL). Firstly, a novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images. In this stage, virtual images are generated based on prompts with specific characterizations and selected by multi-filtering processes. Secondly, we use both virtual and real images to train the model with the teacher-student framework. Considering the initial labels of some virtual images are inaccurate and inadequate, we devise an Adaptive Matching-and-Filtering (AMF) module to construct pseudo-labels. Our method is independent of the internal structure of HOI detectors, so it can be combined with off-the-shelf methods by training merely 10 additional epochs. With the assistance of our method, multiple methods obtain significant improvements, and new state-of-the-art results are achieved on two benchmarks.

* Accepted by ACM MM 2023 
Viaarxiv icon

Approximated Prompt Tuning for Vision-Language Pre-trained Models

Jun 27, 2023
Qiong Wu, Shubin Huang, Yiyi Zhou, Pingyang Dai, Annan Shu, Guannan Jiang, Rongrong Ji

Figure 1 for Approximated Prompt Tuning for Vision-Language Pre-trained Models
Figure 2 for Approximated Prompt Tuning for Vision-Language Pre-trained Models
Figure 3 for Approximated Prompt Tuning for Vision-Language Pre-trained Models
Figure 4 for Approximated Prompt Tuning for Vision-Language Pre-trained Models

Prompt tuning is a parameter-efficient way to deploy large-scale pre-trained models to downstream tasks by adding task-specific tokens. In terms of vision-language pre-trained (VLP) models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks, which greatly exacerbates the already high computational overhead. In this paper, we revisit the principle of prompt tuning for Transformer-based VLP models and reveal that the impact of soft prompt tokens can be actually approximated via independent information diffusion steps, thereby avoiding the expensive global attention modeling and reducing the computational complexity to a large extent. Based on this finding, we propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning. To validate APT, we apply it to two representative VLP models, namely ViLT and METER, and conduct extensive experiments on a bunch of downstream tasks. Meanwhile, the generalization of APT is also validated on CLIP for image classification. The experimental results not only show the superior performance gains and computation efficiency of APT against the conventional prompt tuning methods, e.g., +6.6% accuracy and -64.62% additional computation overhead on METER, but also confirm its merits over other parameter-efficient transfer learning approaches.

Viaarxiv icon

InterFormer: Real-time Interactive Image Segmentation

Apr 06, 2023
You Huang, Hao Yang, Ke Sun, Shengchuan Zhang, Guannan Jiang, Rongrong Ji, Liujuan Cao

Figure 1 for InterFormer: Real-time Interactive Image Segmentation
Figure 2 for InterFormer: Real-time Interactive Image Segmentation
Figure 3 for InterFormer: Real-time Interactive Image Segmentation
Figure 4 for InterFormer: Real-time Interactive Image Segmentation

Interactive image segmentation enables annotators to efficiently perform pixel-level annotation for segmentation tasks. However, the existing interactive segmentation pipeline suffers from inefficient computations of interactive models because of the following two issues. First, annotators' later click is based on models' feedback of annotators' former click. This serial interaction is unable to utilize model's parallelism capabilities. Second, the model has to repeatedly process the image, the annotator's current click, and the model's feedback of the annotator's former clicks at each step of interaction, resulting in redundant computations. For efficient computation, we propose a method named InterFormer that follows a new pipeline to address these issues. InterFormer extracts and preprocesses the computationally time-consuming part i.e. image processing from the existing process. Specifically, InterFormer employs a large vision transformer (ViT) on high-performance devices to preprocess images in parallel, and then uses a lightweight module called interactive multi-head self attention (I-MSA) for interactive segmentation. Furthermore, the I-MSA module's deployment on low-power devices extends the practical application of interactive segmentation. The I-MSA module utilizes the preprocessed features to efficiently response to the annotator inputs in real-time. The experiments on several datasets demonstrate the effectiveness of InterFormer, which outperforms previous interactive segmentation models in terms of computational efficiency and segmentation quality, achieve real-time high-quality interactive segmentation on CPU-only devices.

Viaarxiv icon

X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance

Mar 28, 2023
Yiwei Ma, Xiaioqing Zhang, Xiaoshuai Sun, Jiayi Ji, Haowei Wang, Guannan Jiang, Weilin Zhuang, Rongrong Ji

Figure 1 for X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance
Figure 2 for X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance
Figure 3 for X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance
Figure 4 for X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance

Text-driven 3D stylization is a complex and crucial task in the fields of computer vision (CV) and computer graphics (CG), aimed at transforming a bare mesh to fit a target text. Prior methods adopt text-independent multilayer perceptrons (MLPs) to predict the attributes of the target mesh with the supervision of CLIP loss. However, such text-independent architecture lacks textual guidance during predicting attributes, thus leading to unsatisfactory stylization and slow convergence. To address these limitations, we present X-Mesh, an innovative text-driven 3D stylization framework that incorporates a novel Text-guided Dynamic Attention Module (TDAM). The TDAM dynamically integrates the guidance of the target text by utilizing text-relevant spatial and channel-wise attentions during vertex feature extraction, resulting in more accurate attribute prediction and faster convergence speed. Furthermore, existing works lack standard benchmarks and automated metrics for evaluation, often relying on subjective and non-reproducible user studies to assess the quality of stylized 3D assets. To overcome this limitation, we introduce a new standard text-mesh benchmark, namely MIT-30, and two automated metrics, which will enable future research to achieve fair and objective comparisons. Our extensive qualitative and quantitative experiments demonstrate that X-Mesh outperforms previous state-of-the-art methods.

* Technical report 
Viaarxiv icon

SpatialFormer: Semantic and Target Aware Attentions for Few-Shot Learning

Mar 15, 2023
Jinxiang Lai, Siqian Yang, Wenlong Wu, Tao Wu, Guannan Jiang, Xi Wang, Jun Liu, Bin-Bin Gao, Wei Zhang, Yuan Xie, Chengjie Wang

Figure 1 for SpatialFormer: Semantic and Target Aware Attentions for Few-Shot Learning
Figure 2 for SpatialFormer: Semantic and Target Aware Attentions for Few-Shot Learning
Figure 3 for SpatialFormer: Semantic and Target Aware Attentions for Few-Shot Learning
Figure 4 for SpatialFormer: Semantic and Target Aware Attentions for Few-Shot Learning

Recent Few-Shot Learning (FSL) methods put emphasis on generating a discriminative embedding features to precisely measure the similarity between support and query sets. Current CNN-based cross-attention approaches generate discriminative representations via enhancing the mutually semantic similar regions of support and query pairs. However, it suffers from two problems: CNN structure produces inaccurate attention map based on local features, and mutually similar backgrounds cause distraction. To alleviate these problems, we design a novel SpatialFormer structure to generate more accurate attention regions based on global features. Different from the traditional Transformer modeling intrinsic instance-level similarity which causes accuracy degradation in FSL, our SpatialFormer explores the semantic-level similarity between pair inputs to boost the performance. Then we derive two specific attention modules, named SpatialFormer Semantic Attention (SFSA) and SpatialFormer Target Attention (SFTA), to enhance the target object regions while reduce the background distraction. Particularly, SFSA highlights the regions with same semantic information between pair features, and SFTA finds potential foreground object regions of novel feature that are similar to base categories. Extensive experiments show that our methods are effective and achieve new state-of-the-art results on few-shot classification benchmarks.

* AAAI 2023  
Viaarxiv icon

Towards Efficient Visual Adaption via Structural Re-parameterization

Feb 16, 2023
Gen Luo, Minglang Huang, Yiyi Zhou, Xiaoshuai Sun, Guannan Jiang, Zhiyu Wang, Rongrong Ji

Figure 1 for Towards Efficient Visual Adaption via Structural Re-parameterization
Figure 2 for Towards Efficient Visual Adaption via Structural Re-parameterization
Figure 3 for Towards Efficient Visual Adaption via Structural Re-parameterization
Figure 4 for Towards Efficient Visual Adaption via Structural Re-parameterization

Parameter-efficient transfer learning (PETL) is an emerging research spot aimed at inexpensively adapting large-scale pre-trained models to downstream tasks. Recent advances have achieved great success in saving storage costs for various vision tasks by updating or injecting a small number of parameters instead of full fine-tuning. However, we notice that most existing PETL methods still incur non-negligible latency during inference. In this paper, we propose a parameter-efficient and computationally friendly adapter for giant vision models, called RepAdapter. Specifically, we prove that the adaption modules, even with a complex structure, can be seamlessly integrated into most giant vision models via structural re-parameterization. This property makes RepAdapter zero-cost during inference. In addition to computation efficiency, RepAdapter is more effective and lightweight than existing PETL methods due to its sparse structure and our careful deployment. To validate RepAdapter, we conduct extensive experiments on 27 benchmark datasets of three vision tasks, i.e., image and video classifications and semantic segmentation. Experimental results show the superior performance and efficiency of RepAdapter than the state-of-the-art PETL methods. For instance, by updating only 0.6% parameters, we can improve the performance of ViT from 38.8 to 55.1 on Sun397. Its generalizability is also well validated by a bunch of vision models, i.e., ViT, CLIP, Swin-Transformer and ConvNeXt. Our source code is released at https://github.com/luogen1996/RepAdapter.

Viaarxiv icon

Global Meets Local: Effective Multi-Label Image Classification via Category-Aware Weak Supervision

Nov 23, 2022
Jiawei Zhan, Jun Liu, Wei Tang, Guannan Jiang, Xi Wang, Bin-Bin Gao, Tianliang Zhang, Wenlong Wu, Wei Zhang, Chengjie Wang, Yuan Xie

Figure 1 for Global Meets Local: Effective Multi-Label Image Classification via Category-Aware Weak Supervision
Figure 2 for Global Meets Local: Effective Multi-Label Image Classification via Category-Aware Weak Supervision
Figure 3 for Global Meets Local: Effective Multi-Label Image Classification via Category-Aware Weak Supervision
Figure 4 for Global Meets Local: Effective Multi-Label Image Classification via Category-Aware Weak Supervision

Multi-label image classification, which can be categorized into label-dependency and region-based methods, is a challenging problem due to the complex underlying object layouts. Although region-based methods are less likely to encounter issues with model generalizability than label-dependency methods, they often generate hundreds of meaningless or noisy proposals with non-discriminative information, and the contextual dependency among the localized regions is often ignored or over-simplified. This paper builds a unified framework to perform effective noisy-proposal suppression and to interact between global and local features for robust feature learning. Specifically, we propose category-aware weak supervision to concentrate on non-existent categories so as to provide deterministic information for local feature learning, restricting the local branch to focus on more high-quality regions of interest. Moreover, we develop a cross-granularity attention module to explore the complementary information between global and local features, which can build the high-order feature correlation containing not only global-to-local, but also local-to-local relations. Both advantages guarantee a boost in the performance of the whole network. Extensive experiments on two large-scale datasets (MS-COCO and VOC 2007) demonstrate that our framework achieves superior performance over state-of-the-art methods.

* Proceedings of the 30th ACM International Conference on Multimedia. 2022: 6318-6326  
* 12 pages, 10 figures, published in ACMMM 2022 
Viaarxiv icon

Rethinking the Metric in Few-shot Learning: From an Adaptive Multi-Distance Perspective

Nov 02, 2022
Jinxiang Lai, Siqian Yang, Guannan Jiang, Xi Wang, Yuxi Li, Zihui Jia, Xiaochen Chen, Jun Liu, Bin-Bin Gao, Wei Zhang, Yuan Xie, Chengjie Wang

Figure 1 for Rethinking the Metric in Few-shot Learning: From an Adaptive Multi-Distance Perspective
Figure 2 for Rethinking the Metric in Few-shot Learning: From an Adaptive Multi-Distance Perspective
Figure 3 for Rethinking the Metric in Few-shot Learning: From an Adaptive Multi-Distance Perspective
Figure 4 for Rethinking the Metric in Few-shot Learning: From an Adaptive Multi-Distance Perspective

Few-shot learning problem focuses on recognizing unseen classes given a few labeled images. In recent effort, more attention is paid to fine-grained feature embedding, ignoring the relationship among different distance metrics. In this paper, for the first time, we investigate the contributions of different distance metrics, and propose an adaptive fusion scheme, bringing significant improvements in few-shot classification. We start from a naive baseline of confidence summation and demonstrate the necessity of exploiting the complementary property of different distance metrics. By finding the competition problem among them, built upon the baseline, we propose an Adaptive Metrics Module (AMM) to decouple metrics fusion into metric-prediction fusion and metric-losses fusion. The former encourages mutual complementary, while the latter alleviates metric competition via multi-task collaborative learning. Based on AMM, we design a few-shot classification framework AMTNet, including the AMM and the Global Adaptive Loss (GAL), to jointly optimize the few-shot task and auxiliary self-supervised task, making the embedding features more robust. In the experiment, the proposed AMM achieves 2% higher performance than the naive metrics fusion module, and our AMTNet outperforms the state-of-the-arts on multiple benchmark datasets.

* Proceedings of the 30th ACM International Conference on Multimedia 2022  
Viaarxiv icon

Class-Aware Contrastive Semi-Supervised Learning

Mar 24, 2022
Fan Yang, Kai Wu, Shuyi Zhang, Guannan Jiang, Yong Liu, Feng Zheng, Wei Zhang, Chengjie Wang, Long Zeng

Figure 1 for Class-Aware Contrastive Semi-Supervised Learning
Figure 2 for Class-Aware Contrastive Semi-Supervised Learning
Figure 3 for Class-Aware Contrastive Semi-Supervised Learning
Figure 4 for Class-Aware Contrastive Semi-Supervised Learning

Pseudo-label-based semi-supervised learning (SSL) has achieved great success on raw data utilization. However, its training procedure suffers from confirmation bias due to the noise contained in self-generated artificial labels. Moreover, the model's judgment becomes noisier in real-world applications with extensive out-of-distribution data. To address this issue, we propose a general method named Class-aware Contrastive Semi-Supervised Learning (CCSSL), which is a drop-in helper to improve the pseudo-label quality and enhance the model's robustness in the real-world setting. Rather than treating real-world data as a union set, our method separately handles reliable in-distribution data with class-wise clustering for blending into downstream tasks and noisy out-of-distribution data with image-wise contrastive for better generalization. Furthermore, by applying target re-weighting, we successfully emphasize clean label learning and simultaneously reduce noisy label learning. Despite its simplicity, our proposed CCSSL has significant performance improvements over the state-of-the-art SSL methods on the standard datasets CIFAR100 and STL10. On the real-world dataset Semi-iNat 2021, we improve FixMatch by 9.80% and CoMatch by 3.18%. Code is available https://github.com/TencentYoutuResearch/Classification-SemiCLS.

* cvpr2022 accepted, half more page for adding rebuttal Infos 
Viaarxiv icon