Alert button
Picture for Longhui Wei

Longhui Wei

Alert button

Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion

Aug 08, 2023
Zixuan Ni, Longhui Wei, Jiacheng Li, Siliang Tang, Yueting Zhuang, Qi Tian

Figure 1 for Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion
Figure 2 for Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion
Figure 3 for Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion
Figure 4 for Degeneration-Tuning: Using Scrambled Grid shield Unwanted Concepts from Stable Diffusion

Owing to the unrestricted nature of the content in the training data, large text-to-image diffusion models, such as Stable Diffusion (SD), are capable of generating images with potentially copyrighted or dangerous content based on corresponding textual concepts information. This includes specific intellectual property (IP), human faces, and various artistic styles. However, Negative Prompt, a widely used method for content removal, frequently fails to conceal this content due to inherent limitations in its inference logic. In this work, we propose a novel strategy named \textbf{Degeneration-Tuning (DT)} to shield contents of unwanted concepts from SD weights. By utilizing Scrambled Grid to reconstruct the correlation between undesired concepts and their corresponding image domain, we guide SD to generate meaningless content when such textual concepts are provided as input. As this adaptation occurs at the level of the model's weights, the SD, after DT, can be grafted onto other conditional diffusion frameworks like ControlNet to shield unwanted concepts. In addition to qualitatively showcasing the effectiveness of our DT method in protecting various types of concepts, a quantitative comparison of the SD before and after DT indicates that the DT method does not significantly impact the generative quality of other contents. The FID and IS scores of the model on COCO-30K exhibit only minor changes after DT, shifting from 12.61 and 39.20 to 13.04 and 38.25, respectively, which clearly outperforms the previous methods.

* ACM MM 2023  
Viaarxiv icon

SDDM: Score-Decomposed Diffusion Models on Manifolds for Unpaired Image-to-Image Translation

Aug 04, 2023
Shikun Sun, Longhui Wei, Junliang Xing, Jia Jia, Qi Tian

Figure 1 for SDDM: Score-Decomposed Diffusion Models on Manifolds for Unpaired Image-to-Image Translation
Figure 2 for SDDM: Score-Decomposed Diffusion Models on Manifolds for Unpaired Image-to-Image Translation
Figure 3 for SDDM: Score-Decomposed Diffusion Models on Manifolds for Unpaired Image-to-Image Translation
Figure 4 for SDDM: Score-Decomposed Diffusion Models on Manifolds for Unpaired Image-to-Image Translation

Recent score-based diffusion models (SBDMs) show promising results in unpaired image-to-image translation (I2I). However, existing methods, either energy-based or statistically-based, provide no explicit form of the interfered intermediate generative distributions. This work presents a new score-decomposed diffusion model (SDDM) on manifolds to explicitly optimize the tangled distributions during image generation. SDDM derives manifolds to make the distributions of adjacent time steps separable and decompose the score function or energy guidance into an image ``denoising" part and a content ``refinement" part. To refine the image in the same noise level, we equalize the refinement parts of the score function and energy guidance, which permits multi-objective optimization on the manifold. We also leverage the block adaptive instance normalization module to construct manifolds with lower dimensions but still concentrated with the perturbed reference image. SDDM outperforms existing SBDM-based methods with much fewer diffusion steps on several I2I benchmarks.

Viaarxiv icon

Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models

Jun 14, 2023
Lingxi Xie, Longhui Wei, Xiaopeng Zhang, Kaifeng Bi, Xiaotao Gu, Jianlong Chang, Qi Tian

Figure 1 for Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models
Figure 2 for Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models
Figure 3 for Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models
Figure 4 for Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models

The AI community has been pursuing algorithms known as artificial general intelligence (AGI) that apply to any kind of real-world problem. Recently, chat systems powered by large language models (LLMs) emerge and rapidly become a promising direction to achieve AGI in natural language processing (NLP), but the path towards AGI in computer vision (CV) remains unclear. One may owe the dilemma to the fact that visual signals are more complex than language signals, yet we are interested in finding concrete reasons, as well as absorbing experiences from GPT and LLMs to solve the problem. In this paper, we start with a conceptual definition of AGI and briefly review how NLP solves a wide range of tasks via a chat system. The analysis inspires us that unification is the next important goal of CV. But, despite various efforts in this direction, CV is still far from a system like GPT that naturally integrates all tasks. We point out that the essential weakness of CV lies in lacking a paradigm to learn from environments, yet NLP has accomplished the task in the text world. We then imagine a pipeline that puts a CV algorithm (i.e., an agent) in world-scale, interactable environments, pre-trains it to predict future frames with respect to its action, and then fine-tunes it with instruction to accomplish various tasks. We expect substantial research and engineering efforts to push the idea forward and scale it up, for which we share our perspectives on future research directions.

* 17 pages, 14 figures, technical report, expected to be updated in the near future 
Viaarxiv icon

Continual Vision-Language Representation Learning with Off-Diagonal Information

May 17, 2023
Zixuan Ni, Longhui Wei, Siliang Tang, Yueting Zhuang, Qi Tian

Figure 1 for Continual Vision-Language Representation Learning with Off-Diagonal Information
Figure 2 for Continual Vision-Language Representation Learning with Off-Diagonal Information
Figure 3 for Continual Vision-Language Representation Learning with Off-Diagonal Information
Figure 4 for Continual Vision-Language Representation Learning with Off-Diagonal Information

Large-scale multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training. However, these samples are always collected continuously in real scenarios. This paper discusses the feasibility of continual CLIP training using streaming data. Unlike continual learning based on self-supervised learning methods for pure images, which is empirically robust against catastrophic forgetting, CLIP's performance degeneration in the continual setting is significant and non-neglectable. By analyzing the changes in the model's representation space during continual CLIP training from a spatial geometry perspective, we explore and summarize these spatial variations as Spatial Disorder (SD), which can be divided into Intra-modal Rotation and Inter-modal Deviation. Moreover, we empirically and theoretically demonstrate how SD leads to a performance decline for CLIP on cross-modal retrieval tasks. To alleviate SD, we propose a new continual vision-language representation learning framework Mod-X: Maintain off-diagonal information-matriX. By selectively aligning the off-diagonal information distribution of contrastive matrices, the Mod-X improves the capability of the multi-modal model by maintaining the multi-modal representation space alignment on the old data domain during continuously fitting the new training data domain. Experiments on commonly used datasets with different scales and scopes have demonstrated the effectiveness of our method.

* ICML 2023  
Viaarxiv icon

Learning Transferable Pedestrian Representation from Multimodal Information Supervision

Apr 12, 2023
Liping Bao, Longhui Wei, Xiaoyu Qiu, Wengang Zhou, Houqiang Li, Qi Tian

Figure 1 for Learning Transferable Pedestrian Representation from Multimodal Information Supervision
Figure 2 for Learning Transferable Pedestrian Representation from Multimodal Information Supervision
Figure 3 for Learning Transferable Pedestrian Representation from Multimodal Information Supervision
Figure 4 for Learning Transferable Pedestrian Representation from Multimodal Information Supervision

Recent researches on unsupervised person re-identification~(reID) have demonstrated that pre-training on unlabeled person images achieves superior performance on downstream reID tasks than pre-training on ImageNet. However, those pre-trained methods are specifically designed for reID and suffer flexible adaption to other pedestrian analysis tasks. In this paper, we propose VAL-PAT, a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information. To train our framework, we introduce three learning objectives, \emph{i.e.,} self-supervised contrastive learning, image-text contrastive learning and multi-attribute classification. The self-supervised contrastive learning facilitates the learning of the intrinsic pedestrian properties, while the image-text contrastive learning guides the model to focus on the appearance information of pedestrians.Meanwhile, multi-attribute classification encourages the model to recognize attributes to excavate fine-grained pedestrian information. We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations, and then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search. Extensive experiments demonstrate that our framework facilitates the learning of general pedestrian representations and thus leads to promising results on various pedestrian analysis tasks.

Viaarxiv icon

Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models

Mar 12, 2023
Juncheng Li, Minghe Gao, Longhui Wei, Siliang Tang, Wenqiao Zhang, Mengze Li, Wei Ji, Qi Tian, Tat-Seng Chua, Yueting Zhuang

Figure 1 for Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models
Figure 2 for Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models
Figure 3 for Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models
Figure 4 for Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models

Prompt tuning, a recently emerging paradigm, enables the powerful vision-language pre-training models to adapt to downstream tasks in a parameter -- and data -- efficient way, by learning the ``soft prompts'' to condition frozen pre-training models. Though effective, it is particularly problematic in the few-shot scenario, where prompt tuning performance is sensitive to the initialization and requires a time-consuming process to find a good initialization, thus restricting the fast adaptation ability of the pre-training models. In addition, prompt tuning could undermine the generalizability of the pre-training models, because the learnable prompt tokens are easy to overfit to the limited training samples. To address these issues, we introduce a novel Gradient-RegulAted Meta-prompt learning (GRAM) framework that jointly meta-learns an efficient soft prompt initialization for better adaptation and a lightweight gradient regulating function for strong cross-domain generalizability in a meta-learning paradigm using only the unlabeled image-text pre-training data. Rather than designing a specific prompt tuning method, our GRAM can be easily incorporated into various prompt tuning methods in a model-agnostic way, and comprehensive experiments show that GRAM brings about consistent improvement for them in several settings (i.e., few-shot learning, cross-domain generalization, cross-dataset generalization, etc.) over 11 datasets. Further, experiments show that GRAM enables the orthogonal methods of textual and visual prompt tuning to work in a mutually-enhanced way, offering better generalizability beyond the uni-modal prompt tuning methods.

Viaarxiv icon

Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding

Mar 07, 2023
Jiacheng Li, Longhui Wei, ZongYuan Zhan, Xin He, Siliang Tang, Qi Tian, Yueting Zhuang

Figure 1 for Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding
Figure 2 for Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding
Figure 3 for Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding
Figure 4 for Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding

Generative transformers have shown their superiority in synthesizing high-fidelity and high-resolution images, such as good diversity and training stability. However, they suffer from the problem of slow generation since they need to generate a long token sequence autoregressively. To better accelerate the generative transformers while keeping good generation quality, we propose Lformer, a semi-autoregressive text-to-image generation model. Lformer firstly encodes an image into $h{\times}h$ discrete tokens, then divides these tokens into $h$ mirrored L-shape blocks from the top left to the bottom right and decodes the tokens in a block parallelly in each step. Lformer predicts the area adjacent to the previous context like autoregressive models thus it is more stable while accelerating. By leveraging the 2D structure of image tokens, Lformer achieves faster speed than the existing transformer-based methods while keeping good generation quality. Moreover, the pretrained Lformer can edit images without the requirement for finetuning. We can roll back to the early steps for regeneration or edit the image with a bounding box and a text prompt.

Viaarxiv icon

Integrally Pre-Trained Transformer Pyramid Networks

Nov 23, 2022
Yunjie Tian, Lingxi Xie, Zhaozhi Wang, Longhui Wei, Xiaopeng Zhang, Jianbin Jiao, Yaowei Wang, Qi Tian, Qixiang Ye

Figure 1 for Integrally Pre-Trained Transformer Pyramid Networks
Figure 2 for Integrally Pre-Trained Transformer Pyramid Networks
Figure 3 for Integrally Pre-Trained Transformer Pyramid Networks
Figure 4 for Integrally Pre-Trained Transformer Pyramid Networks

In this paper, we present an integral pre-training framework based on masked image modeling (MIM). We advocate for pre-training the backbone and neck jointly so that the transfer gap between MIM and downstream recognition tasks is minimal. We make two technical contributions. First, we unify the reconstruction and recognition necks by inserting a feature pyramid into the pre-training stage. Second, we complement mask image modeling (MIM) with masked feature modeling (MFM) that offers multi-stage supervision to the feature pyramid. The pre-trained models, termed integrally pre-trained transformer pyramid networks (iTPNs), serve as powerful foundation models for visual recognition. In particular, the base/large-level iTPN achieves an 86.2%/87.8% top-1 accuracy on ImageNet-1K, a 53.2%/55.6% box AP on COCO object detection with 1x training schedule using Mask-RCNN, and a 54.7%/57.7% mIoU on ADE20K semantic segmentation using UPerHead -- all these results set new records. Our work inspires the community to work on unifying upstream pre-training and downstream fine-tuning tasks. Code and the pre-trained models will be released at https://github.com/sunsmarterjie/iTPN.

* 13 pages, 5 figures, 13 tables 
Viaarxiv icon