Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yunchao Wei

CoinSeg: Contrast Inter- and Intra- Class Representations for Incremental Segmentation

Oct 10, 2023

Zekang Zhang, Guangyu Gao, Jianbo Jiao, Chi Harold Liu, Yunchao Wei

Figure 1 for CoinSeg: Contrast Inter- and Intra- Class Representations for Incremental Segmentation

Figure 2 for CoinSeg: Contrast Inter- and Intra- Class Representations for Incremental Segmentation

Figure 3 for CoinSeg: Contrast Inter- and Intra- Class Representations for Incremental Segmentation

Figure 4 for CoinSeg: Contrast Inter- and Intra- Class Representations for Incremental Segmentation

Abstract:Class incremental semantic segmentation aims to strike a balance between the model's stability and plasticity by maintaining old knowledge while adapting to new concepts. However, most state-of-the-art methods use the freeze strategy for stability, which compromises the model's plasticity.In contrast, releasing parameter training for plasticity could lead to the best performance for all categories, but this requires discriminative feature representation.Therefore, we prioritize the model's plasticity and propose the Contrast inter- and intra-class representations for Incremental Segmentation (CoinSeg), which pursues discriminative representations for flexible parameter tuning. Inspired by the Gaussian mixture model that samples from a mixture of Gaussian distributions, CoinSeg emphasizes intra-class diversity with multiple contrastive representation centroids. Specifically, we use mask proposals to identify regions with strong objectness that are likely to be diverse instances/centroids of a category. These mask proposals are then used for contrastive representations to reinforce intra-class diversity. Meanwhile, to avoid bias from intra-class diversity, we also apply category-level pseudo-labels to enhance category-level consistency and inter-category diversity. Additionally, CoinSeg ensures the model's stability and alleviates forgetting through a specific flexible tuning strategy. We validate CoinSeg on Pascal VOC 2012 and ADE20K datasets with multiple incremental scenarios and achieve superior results compared to previous state-of-the-art methods, especially in more challenging and realistic long-term scenarios. Code is available at https://github.com/zkzhang98/CoinSeg.

* Accepted by ICCV 2023

Via

Access Paper or Ask Questions

Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

Sep 30, 2023

Siyu Jiao, Yunchao Wei, Yaowei Wang, Yao Zhao, Humphrey Shi

Figure 1 for Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

Figure 2 for Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

Figure 3 for Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

Figure 4 for Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

Abstract:Recently, pre-trained vision-language models have been increasingly used to tackle the challenging zero-shot segmentation task. Typical solutions follow the paradigm of first generating mask proposals and then adopting CLIP to classify them. To maintain the CLIP's zero-shot transferability, previous practices favour to freeze CLIP during training. However, in the paper, we reveal that CLIP is insensitive to different mask proposals and tends to produce similar predictions for various mask proposals of the same image. This insensitivity results in numerous false positives when classifying mask proposals. This issue mainly relates to the fact that CLIP is trained with image-level supervision. To alleviate this issue, we propose a simple yet effective method, named Mask-aware Fine-tuning (MAFT). Specifically, Image-Proposals CLIP Encoder (IP-CLIP Encoder) is proposed to handle arbitrary numbers of image and mask proposals simultaneously. Then, mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP Encoder, ensuring CLIP is responsive to different mask proposals while not sacrificing transferability. In this way, mask-aware representations can be easily learned to make the true positives stand out. Notably, our solution can seamlessly plug into most existing methods without introducing any new parameters during the fine-tuning process. We conduct extensive experiments on the popular zero-shot benchmarks. With MAFT, the performance of the state-of-the-art methods is promoted by a large margin: 50.4% (+ 8.2%) on COCO, 81.8% (+ 3.2%) on Pascal-VOC, and 8.7% (+4.3%) on ADE20K in terms of mIoU for unseen classes. The code is available at https://github.com/jiaosiyu1999/MAFT.git.

* NeurIPS 2023

Via

Access Paper or Ask Questions

Unified Frequency-Assisted Transformer Framework for Detecting and Grounding Multi-Modal Manipulation

Sep 18, 2023

Huan Liu, Zichang Tan, Qiang Chen, Yunchao Wei, Yao Zhao, Jingdong Wang

Figure 1 for Unified Frequency-Assisted Transformer Framework for Detecting and Grounding Multi-Modal Manipulation

Figure 2 for Unified Frequency-Assisted Transformer Framework for Detecting and Grounding Multi-Modal Manipulation

Figure 3 for Unified Frequency-Assisted Transformer Framework for Detecting and Grounding Multi-Modal Manipulation

Figure 4 for Unified Frequency-Assisted Transformer Framework for Detecting and Grounding Multi-Modal Manipulation

Abstract:Detecting and grounding multi-modal media manipulation (DGM^4) has become increasingly crucial due to the widespread dissemination of face forgery and text misinformation. In this paper, we present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM^4 problem. Unlike previous state-of-the-art methods that solely focus on the image (RGB) domain to describe visual forgery features, we additionally introduce the frequency domain as a complementary viewpoint. By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts. Then, our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands. Moreover, to address the semantic conflicts between image and frequency domains, the forgery-aware mutual module is developed to further enable the effective interaction of disparate image and frequency features, resulting in aligned and comprehensive visual forgery representations. Finally, based on visual and textual forgery features, we propose a unified decoder that comprises two symmetric cross-modal interaction modules responsible for gathering modality-specific forgery information, along with a fusing interaction module for aggregation of both modalities. The proposed unified decoder formulates our UFAFormer as a unified framework, ultimately simplifying the overall architecture and facilitating the optimization process. Experimental results on the DGM^4 dataset, containing several perturbations, demonstrate the superior performance of our framework compared to previous methods, setting a new benchmark in the field.

Via

Access Paper or Ask Questions

CLE Diffusion: Controllable Light Enhancement Diffusion Model

Aug 28, 2023

Yuyang Yin, Dejia Xu, Chuangchuang Tan, Ping Liu, Yao Zhao, Yunchao Wei

Figure 1 for CLE Diffusion: Controllable Light Enhancement Diffusion Model

Figure 2 for CLE Diffusion: Controllable Light Enhancement Diffusion Model

Figure 3 for CLE Diffusion: Controllable Light Enhancement Diffusion Model

Figure 4 for CLE Diffusion: Controllable Light Enhancement Diffusion Model

Abstract:Low light enhancement has gained increasing importance with the rapid development of visual creation and editing. However, most existing enhancement algorithms are designed to homogeneously increase the brightness of images to a pre-defined extent, limiting the user experience. To address this issue, we propose Controllable Light Enhancement Diffusion Model, dubbed CLE Diffusion, a novel diffusion framework to provide users with rich controllability. Built with a conditional diffusion model, we introduce an illumination embedding to let users control their desired brightness level. Additionally, we incorporate the Segment-Anything Model (SAM) to enable user-friendly region controllability, where users can click on objects to specify the regions they wish to enhance. Extensive experiments demonstrate that CLE Diffusion achieves competitive performance regarding quantitative metrics, qualitative results, and versatile controllability. Project page: https://yuyangyin.github.io/CLEDiffusion/

* Accepted In Proceedings of the 31st ACM International Conference on Multimedia (MM' 23)

Via

Access Paper or Ask Questions

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

Aug 20, 2023

Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Fu, Guosheng Lin, Chunhua Shen, Ling Chen, Yunchao Wei

Abstract:The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have sparked significant interest in the development of multimodal Large Language Models (LLMs). A primary research objective of such models is to align visual and textual modalities effectively while comprehending human instructions. Current methodologies often rely on annotations derived from benchmark datasets to construct image-dialogue datasets for training purposes, akin to instruction tuning in LLMs. However, these datasets often exhibit domain bias, potentially constraining the generative capabilities of the models. In an effort to mitigate these limitations, we propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models to yield a diverse and controllable dataset with varied image content. This not only provides greater flexibility compared to existing methodologies but also significantly enhances several model capabilities. Our research includes comprehensive experiments conducted on various datasets using the open-source LLAVA model as a testbed for our proposed pipeline. Our results underscore marked enhancements across more than ten commonly assessed capabilities,

* Project page: https://github.com/icoz69/StableLLAVA

Via

Access Paper or Ask Questions

CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation

Aug 14, 2023

Hongguang Zhu, Yunchao Wei, Xiaodan Liang, Chunjie Zhang, Yao Zhao

Figure 1 for CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation

Figure 2 for CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation

Figure 3 for CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation

Figure 4 for CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation

Abstract:Vision-Language Pretraining (VLP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets. Regarding the growing nature of real-world data, such an offline training paradigm on ever-expanding data is unsustainable, because models lack the continual learning ability to accumulate knowledge constantly. However, most continual learning studies are limited to uni-modal classification and existing multi-modal datasets cannot simulate continual non-stationary data stream scenarios. To support the study of Vision-Language Continual Pretraining (VLCP), we first contribute a comprehensive and unified benchmark dataset P9D which contains over one million product image-text pairs from 9 industries. The data from each industry as an independent task supports continual learning and conforms to the real-world long-tail nature to simulate pretraining on web data. We comprehensively study the characteristics and challenges of VLCP, and propose a new algorithm: Compatible momentum contrast with Topology Preservation, dubbed CTP. The compatible momentum model absorbs the knowledge of the current and previous-task models to flexibly update the modal feature. Moreover, Topology Preservation transfers the knowledge of embedding across tasks while preserving the flexibility of feature adjustment. The experimental results demonstrate our method not only achieves superior performance compared with other baselines but also does not bring an expensive training burden. Dataset and codes are available at https://github.com/KevinLight831/CTP.

* Accepted by ICCV 2023. Code: https://github.com/KevinLight831/CTP

Via

Access Paper or Ask Questions

Disentangled Pre-training for Image Matting

Apr 03, 2023

Yanda Li, Zilong Huang, Gang Yu, Ling Chen, Yunchao Wei, Jianbo Jiao

Figure 1 for Disentangled Pre-training for Image Matting

Figure 2 for Disentangled Pre-training for Image Matting

Figure 3 for Disentangled Pre-training for Image Matting

Figure 4 for Disentangled Pre-training for Image Matting

Abstract:Image matting requires high-quality pixel-level human annotations to support the training of a deep model in recent literature. Whereas such annotation is costly and hard to scale, significantly holding back the development of the research. In this work, we make the first attempt towards addressing this problem, by proposing a self-supervised pre-training approach that can leverage infinite numbers of data to boost the matting performance. The pre-training task is designed in a similar manner as image matting, where random trimap and alpha matte are generated to achieve an image disentanglement objective. The pre-trained model is then used as an initialisation of the downstream matting task for fine-tuning. Extensive experimental evaluations show that the proposed approach outperforms both the state-of-the-art matting methods and other alternative self-supervised initialisation approaches by a large margin. We also show the robustness of the proposed approach over different backbone architectures. The code and models will be publicly available.

Via

Access Paper or Ask Questions

Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot Learning

Mar 27, 2023

Man Liu, Feng Li, Chunjie Zhang, Yunchao Wei, Huihui Bai, Yao Zhao

Abstract:Generalized Zero-Shot Learning (GZSL) identifies unseen categories by knowledge transferred from the seen domain, relying on the intrinsic interactions between visual and semantic information. Prior works mainly localize regions corresponding to the sharing attributes. When various visual appearances correspond to the same attribute, the sharing attributes inevitably introduce semantic ambiguity, hampering the exploration of accurate semantic-visual interactions. In this paper, we deploy the dual semantic-visual transformer module (DSVTM) to progressively model the correspondences between attribute prototypes and visual features, constituting a progressive semantic-visual mutual adaption (PSVMA) network for semantic disambiguation and knowledge transferability improvement. Specifically, DSVTM devises an instance-motivated semantic encoder that learns instance-centric prototypes to adapt to different images, enabling the recast of the unmatched semantic-visual pair into the matched one. Then, a semantic-motivated instance decoder strengthens accurate cross-domain interactions between the matched pair for semantic-related instance adaption, encouraging the generation of unambiguous visual representations. Moreover, to mitigate the bias towards seen classes in GZSL, a debiasing loss is proposed to pursue response consistency between seen and unseen predictions. The PSVMA consistently yields superior performances against other state-of-the-art methods. Code will be available at: https://github.com/ManLiuCoder/PSVMA.

* Accepted by CVPR2023

Via

Access Paper or Ask Questions

Global Knowledge Calibration for Fast Open-Vocabulary Segmentation

Mar 16, 2023

Kunyang Han, Yong Liu, Jun Hao Liew, Henghui Ding, Yunchao Wei, Jiajun Liu, Yitong Wang, Yansong Tang, Yujiu Yang, Jiashi Feng(+1 more)

Abstract:Recent advancements in pre-trained vision-language models, such as CLIP, have enabled the segmentation of arbitrary concepts solely from textual inputs, a process commonly referred to as open-vocabulary semantic segmentation (OVS). However, existing OVS techniques confront a fundamental challenge: the trained classifier tends to overfit on the base classes observed during training, resulting in suboptimal generalization performance to unseen classes. To mitigate this issue, recent studies have proposed the use of an additional frozen pre-trained CLIP for classification. Nonetheless, this approach incurs heavy computational overheads as the CLIP vision encoder must be repeatedly forward-passed for each mask, rendering it impractical for real-world applications. To address this challenge, our objective is to develop a fast OVS model that can perform comparably or better without the extra computational burden of the CLIP image encoder during inference. To this end, we propose a core idea of preserving the generalizable representation when fine-tuning on known classes. Specifically, we introduce a text diversification strategy that generates a set of synonyms for each training category, which prevents the learned representation from collapsing onto specific known category names. Additionally, we employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP. Extensive experiments demonstrate that our proposed model achieves robust generalization performance across various datasets. Furthermore, we perform a preliminary exploration of open-vocabulary video segmentation and present a benchmark that can facilitate future open-vocabulary research in the video domain.

Via

Access Paper or Ask Questions

SLCA: Slow Learner with Classifier Alignment for Continual Learning on a Pre-trained Model

Mar 09, 2023

Gengwei Zhang, Liyuan Wang, Guoliang Kang, Ling Chen, Yunchao Wei

Abstract:The goal of continual learning is to improve the performance of recognition models in learning sequentially arrived data. Although most existing works are established on the premise of learning from scratch, growing efforts have been devoted to incorporating the benefits of pre-training. However, how to adaptively exploit the pre-trained knowledge for each incremental task while maintaining its generalizability remains an open question. In this work, we present an extensive analysis for continual learning on a pre-trained model (CLPM), and attribute the key challenge to a progressive overfitting problem. Observing that selectively reducing the learning rate can almost resolve this issue in the representation layer, we propose a simple but extremely effective approach named Slow Learner with Classifier Alignment (SLCA), which further improves the classification layer by modeling the class-wise distributions and aligning the classification layers in a post-hoc fashion. Across a variety of scenarios, our proposal provides substantial improvements for CLPM (e.g., up to 49.76%, 50.05%, 44.69% and 40.16% on Split CIFAR-100, Split ImageNet-R, Split CUB-200 and Split Cars-196, respectively), and thus outperforms state-of-the-art approaches by a large margin. Based on such a strong baseline, critical factors and promising directions are analyzed in-depth to facilitate subsequent research.

* Tech report. 11 pages, 8 figures

Via

Access Paper or Ask Questions