Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yibing Song

DiffusionDet: Diffusion Model for Object Detection

Nov 17, 2022

Shoufa Chen, Peize Sun, Yibing Song, Ping Luo

Figure 1 for DiffusionDet: Diffusion Model for Object Detection

Figure 2 for DiffusionDet: Diffusion Model for Object Detection

Figure 3 for DiffusionDet: Diffusion Model for Object Detection

Figure 4 for DiffusionDet: Diffusion Model for Object Detection

Abstract:We propose DiffusionDet, a new framework that formulates object detection as a denoising diffusion process from noisy boxes to object boxes. During training stage, object boxes diffuse from ground-truth boxes to random distribution, and the model learns to reverse this noising process. In inference, the model refines a set of randomly generated boxes to the output results in a progressive way. The extensive evaluations on the standard benchmarks, including MS-COCO and LVIS, show that DiffusionDet achieves favorable performance compared to previous well-established detectors. Our work brings two important findings in object detection. First, random boxes, although drastically different from pre-defined anchors or learned queries, are also effective object candidates. Second, object detection, one of the representative perception tasks, can be solved by a generative way. Our code is available at https://github.com/ShoufaChen/DiffusionDet.

* Tech report. Code is available at https://github.com/ShoufaChen/DiffusionDet

Via

Access Paper or Ask Questions

One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations

Oct 17, 2022

Yiming Zhu, Hongyu Liu, Yibing Song, ziyang Yuan, Xintong Han, Chun Yuan, Qifeng Chen, Jue Wang

Figure 1 for One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations

Figure 2 for One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations

Figure 3 for One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations

Figure 4 for One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations

Abstract:Free-form text prompts allow users to describe their intentions during image manipulation conveniently. Based on the visual latent space of StyleGAN[21] and text embedding space of CLIP[34], studies focus on how to map these two latent spaces for text-driven attribute manipulations. Currently, the latent mapping between these two spaces is empirically designed and confines that each manipulation model can only handle one fixed text prompt. In this paper, we propose a method named Free-Form CLIP (FFCLIP), aiming to establish an automatic latent mapping so that one manipulation model handles free-form text prompts. Our FFCLIP has a cross-modality semantic modulation module containing semantic alignment and injection. The semantic alignment performs the automatic latent mapping via linear transformations with a cross attention mechanism. After alignment, we inject semantics from text prompt embeddings to the StyleGAN latent space. For one type of image (e.g., `human portrait'), one FFCLIP model can be learned to handle free-form text prompts. Meanwhile, we observe that although each training text prompt only contains a single semantic meaning, FFCLIP can leverage text prompts with multiple semantic meanings for image manipulation. In the experiments, we evaluate FFCLIP on three types of images (i.e., `human portraits', `cars', and `churches'). Both visual and numerical results show that FFCLIP effectively produces semantically accurate and visually realistic images. Project page: https://github.com/KumapowerLIU/FFCLIP.

* Accepted by NeurIPS 2022

Via

Access Paper or Ask Questions

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

May 26, 2022

Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, Ping Luo

Figure 1 for AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Figure 2 for AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Figure 3 for AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Figure 4 for AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Abstract:Although the pre-trained Vision Transformers (ViTs) achieved great success in computer vision, adapting a ViT to various image and video tasks is challenging because of its heavy computation and storage burdens, where each model needs to be independently and comprehensively fine-tuned to different tasks, limiting its transferability in different domains. To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently. It possesses several benefits more appealing than prior arts. Firstly, AdaptFormer introduces lightweight modules that only add less than 2% extra parameters to a ViT, while it is able to increase the ViT's transferability without updating its original pre-trained parameters, significantly outperforming the existing 100% fully fine-tuned models on action recognition benchmarks. Secondly, it can be plug-and-play in different Transformers and scalable to many visual tasks. Thirdly, extensive experiments on five image and video datasets show that AdaptFormer largely improves ViTs in the target domains. For example, when updating just 1.5% extra parameters, it achieves about 10% and 19% relative improvement compared to the fully fine-tuned models on Something-Something~v2 and HMDB51, respectively. Project page: http://www.shoufachen.com/adaptformer-page.

* Technical report. Project page: http://www.shoufachen.com/adaptformer-page/

Via

Access Paper or Ask Questions

Self-supervised Learning of Adversarial Example: Towards Good Generalizations for Deepfake Detection

Apr 01, 2022

Liang Chen, Yong Zhang, Yibing Song, Lingqiao Liu, Jue Wang

Figure 1 for Self-supervised Learning of Adversarial Example: Towards Good Generalizations for Deepfake Detection

Figure 2 for Self-supervised Learning of Adversarial Example: Towards Good Generalizations for Deepfake Detection

Figure 3 for Self-supervised Learning of Adversarial Example: Towards Good Generalizations for Deepfake Detection

Figure 4 for Self-supervised Learning of Adversarial Example: Towards Good Generalizations for Deepfake Detection

Abstract:Recent studies in deepfake detection have yielded promising results when the training and testing face forgeries are from the same dataset. However, the problem remains challenging when one tries to generalize the detector to forgeries created by unseen methods in the training dataset. This work addresses the generalizable deepfake detection from a simple principle: a generalizable representation should be sensitive to diverse types of forgeries. Following this principle, we propose to enrich the "diversity" of forgeries by synthesizing augmented forgeries with a pool of forgery configurations and strengthen the "sensitivity" to the forgeries by enforcing the model to predict the forgery configurations. To effectively explore the large forgery augmentation space, we further propose to use the adversarial training strategy to dynamically synthesize the most challenging forgeries to the current model. Through extensive experiments, we show that the proposed strategies are surprisingly effective (see Figure 1), and they could achieve superior performance than the current state-of-the-art methods. Code is available at \url{https://github.com/liangchen527/SLADD}.

* Accepted by CVPR 2022 (Oral presentation)

Via

Access Paper or Ask Questions

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Mar 23, 2022

Zhan Tong, Yibing Song, Jue Wang, Limin Wang

Figure 1 for VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Figure 2 for VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Figure 3 for VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Figure 4 for VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Abstract:Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking and reconstruction. These simple designs turn out to be effective for overcoming information leakage caused by the temporal correlation during video reconstruction. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. This is partially ascribed to the challenging task of video reconstruction to enforce high-level structure learning. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets are important issues in SSVP. Notably, our VideoMAE with the vanilla ViT backbone can achieve 83.9% on Kinects-400, 75.3% on Something-Something V2, 90.8% on UCF101, and 61.1% on HMDB51 without using any extra data. Code will be released at https://github.com/MCG-NJU/VideoMAE.

* Technical report

Via

Access Paper or Ask Questions

DynaMixer: A Vision MLP Architecture with Dynamic Mixing

Feb 16, 2022

Ziyu Wang, Wenhao Jiang, Yiming Zhu, Li Yuan, Yibing Song, Wei Liu

Figure 1 for DynaMixer: A Vision MLP Architecture with Dynamic Mixing

Figure 2 for DynaMixer: A Vision MLP Architecture with Dynamic Mixing

Figure 3 for DynaMixer: A Vision MLP Architecture with Dynamic Mixing

Figure 4 for DynaMixer: A Vision MLP Architecture with Dynamic Mixing

Abstract:Recently, MLP-like vision models have achieved promising performances on mainstream visual recognition tasks. In contrast with vision transformers and CNNs, the success of MLP-like models shows that simple information fusion operations among tokens and channels can yield a good representation power for deep recognition models. However, existing MLP-like models fuse tokens through static fusion operations, lacking adaptability to the contents of the tokens to be mixed. Thus, customary information fusion procedures are not effective enough. To this end, this paper presents an efficient MLP-like network architecture, dubbed DynaMixer, resorting to dynamic information fusion. Critically, we propose a procedure, on which the DynaMixer model relies, to dynamically generate mixing matrices by leveraging the contents of all the tokens to be mixed. To reduce the time complexity and improve the robustness, a dimensionality reduction technique and a multi-segment fusion mechanism are adopted. Our proposed DynaMixer model (97M parameters) achieves 84.3\% top-1 accuracy on the ImageNet-1K dataset without extra training data, performing favorably against the state-of-the-art vision MLP models. When the number of parameters is reduced to 26M, it still achieves 82.7\% top-1 accuracy, surpassing the existing MLP-like models with a similar capacity. The implementation of DynaMixer will be made available to the public.

Via

Access Paper or Ask Questions

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Feb 16, 2022

Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, Pengtao Xie

Figure 1 for Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Figure 2 for Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Figure 3 for Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Figure 4 for Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Abstract:Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them. Complete leverage of these image tokens brings redundant computations since not all the tokens are attentive in MHSA. Examples include that tokens containing semantically meaningless or distractive image backgrounds do not positively contribute to the ViT predictions. In this work, we propose to reorganize image tokens during the feed-forward process of ViT models, which is integrated into ViT during training. For each forward inference, we identify the attentive image tokens between MHSA and FFN (i.e., feed-forward network) modules, which is guided by the corresponding class token attention. Then, we reorganize image tokens by preserving attentive image tokens and fusing inattentive ones to expedite subsequent MHSA and FFN computations. To this end, our method EViT improves ViTs from two perspectives. First, under the same amount of input image tokens, our method reduces MHSA and FFN computation for efficient inference. For instance, the inference speed of DeiT-S is increased by 50% while its recognition accuracy is decreased by only 0.3% for ImageNet classification. Second, by maintaining the same computational cost, our method empowers ViTs to take more image tokens as input for recognition accuracy improvement, where the image tokens are from higher resolution images. An example is that we improve the recognition accuracy of DeiT-S by 1% for ImageNet classification at the same computational cost of a vanilla DeiT-S. Meanwhile, our method does not introduce more parameters to ViTs. Experiments on the standard benchmarks show the effectiveness of our method. The code is available at https://github.com/youweiliang/evit

* ICLR 2022 Spotlight

Via

Access Paper or Ask Questions

MetaDance: Few-shot Dancing Video Retargeting via Temporal-aware Meta-learning

Jan 13, 2022

Yuying Ge, Yibing Song, Ruimao Zhang, Ping Luo

Figure 1 for MetaDance: Few-shot Dancing Video Retargeting via Temporal-aware Meta-learning

Figure 2 for MetaDance: Few-shot Dancing Video Retargeting via Temporal-aware Meta-learning

Figure 3 for MetaDance: Few-shot Dancing Video Retargeting via Temporal-aware Meta-learning

Figure 4 for MetaDance: Few-shot Dancing Video Retargeting via Temporal-aware Meta-learning

Abstract:Dancing video retargeting aims to synthesize a video that transfers the dance movements from a source video to a target person. Previous work need collect a several-minute-long video of a target person with thousands of frames to train a personalized model. However, the trained model can only generate videos of the same person. To address the limitations, recent work tackled few-shot dancing video retargeting, which learns to synthesize videos of unseen persons by leveraging a few frames of them. In practice, given a few frames of a person, these work simply regarded them as a batch of individual images without temporal correlations, thus generating temporally incoherent dancing videos of low visual quality. In this work, we model a few frames of a person as a series of dancing moves, where each move contains two consecutive frames, to extract the appearance patterns and the temporal dynamics of this person. We propose MetaDance, which utilizes temporal-aware meta-learning to optimize the initialization of a model through the synthesis of dancing moves, such that the meta-trained model can be efficiently tuned towards enhanced visual quality and strengthened temporal stability for unseen persons with a few frames. Extensive evaluations show large superiority of our method.

Via

Access Paper or Ask Questions

Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning

Oct 11, 2021

Chongjian Ge, Youwei Liang, Yibing Song, Jianbo Jiao, Jue Wang, Ping Luo

Figure 1 for Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning

Figure 2 for Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning

Figure 3 for Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning

Figure 4 for Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning

Abstract:Studies on self-supervised visual representation learning (SSL) improve encoder backbones to discriminate training samples without labels. While CNN encoders via SSL achieve comparable recognition performance to those via supervised learning, their network attention is under-explored for further improvement. Motivated by the transformers that explore visual attention effectively in recognition scenarios, we propose a CNN Attention REvitalization (CARE) framework to train attentive CNN encoders guided by transformers in SSL. The proposed CARE framework consists of a CNN stream (C-stream) and a transformer stream (T-stream), where each stream contains two branches. C-stream follows an existing SSL framework with two CNN encoders, two projectors, and a predictor. T-stream contains two transformers, two projectors, and a predictor. T-stream connects to CNN encoders and is in parallel to the remaining C-Stream. During training, we perform SSL in both streams simultaneously and use the T-stream output to supervise C-stream. The features from CNN encoders are modulated in T-stream for visual attention enhancement and become suitable for the SSL scenario. We use these modulated features to supervise C-stream for learning attentive CNN encoders. To this end, we revitalize CNN attention by using transformers as guidance. Experiments on several standard visual recognition benchmarks, including image classification, object detection, and semantic segmentation, show that the proposed CARE framework improves CNN encoder backbones to the state-of-the-art performance.

* Accepted by NeurIPS 2021

Via

Access Paper or Ask Questions

PD-GAN: Probabilistic Diverse GAN for Image Inpainting

May 05, 2021

Hongyu Liu, Ziyu Wan, Wei Huang, Yibing Song, Xintong Han, Jing Liao

Figure 1 for PD-GAN: Probabilistic Diverse GAN for Image Inpainting

Figure 2 for PD-GAN: Probabilistic Diverse GAN for Image Inpainting

Figure 3 for PD-GAN: Probabilistic Diverse GAN for Image Inpainting

Figure 4 for PD-GAN: Probabilistic Diverse GAN for Image Inpainting

Abstract:We propose PD-GAN, a probabilistic diverse GAN for image inpainting. Given an input image with arbitrary hole regions, PD-GAN produces multiple inpainting results with diverse and visually realistic content. Our PD-GAN is built upon a vanilla GAN which generates images based on random noise. During image generation, we modulate deep features of input random noise from coarse-to-fine by injecting an initially restored image and the hole regions in multiple scales. We argue that during hole filling, the pixels near the hole boundary should be more deterministic (i.e., with higher probability trusting the context and initially restored image to create natural inpainting boundary), while those pixels lie in the center of the hole should enjoy more degrees of freedom (i.e., more likely to depend on the random noise for enhancing diversity). To this end, we propose spatially probabilistic diversity normalization (SPDNorm) inside the modulation to model the probability of generating a pixel conditioned on the context information. SPDNorm dynamically balances the realism and diversity inside the hole region, making the generated content more diverse towards the hole center and resemble neighboring image content more towards the hole boundary. Meanwhile, we propose a perceptual diversity loss to further empower PD-GAN for diverse content generation. Experiments on benchmark datasets including CelebA-HQ, Places2 and Paris Street View indicate that PD-GAN is effective for diverse and visually realistic image restoration.

* Accepted by CVPR 2021

Via

Access Paper or Ask Questions