Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wangmeng Zuo

ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

Feb 27, 2023
Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, Wangmeng Zuo

Figure 1 for ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

Figure 2 for ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

Figure 3 for ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

Figure 4 for ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

Despite unprecedented ability in imaginary creation, large text-to-image models are further expected to express customized concepts. Existing works generally learn such concepts in an optimization-based manner, yet bringing excessive computation or memory burden. In this paper, we instead propose a learning-based encoder for fast and accurate concept customization, which consists of global and local mapping networks. In specific, the global mapping network separately projects the hierarchical features of a given image into multiple ``new'' words in the textual word embedding space, i.e., one primary word for well-editable concept and other auxiliary words to exclude irrelevant disturbances (e.g., background). In the meantime, a local mapping network injects the encoded patch features into cross attention layers to provide omitted details, without sacrificing the editability of primary concepts. We compare our method with prior optimization-based approaches on a variety of user-defined concepts, and demonstrate that our method enables more high-fidelity inversion and robust editability with a significantly faster encoding process. Our code will be publicly available at https://github.com/csyxwei/ELITE.

Via

Access Paper or Ask Questions

Making Substitute Models More Bayesian Can Enhance Transferability of Adversarial Examples

Feb 10, 2023
Qizhang Li, Yiwen Guo, Wangmeng Zuo, Hao Chen

Figure 1 for Making Substitute Models More Bayesian Can Enhance Transferability of Adversarial Examples

Figure 2 for Making Substitute Models More Bayesian Can Enhance Transferability of Adversarial Examples

Figure 3 for Making Substitute Models More Bayesian Can Enhance Transferability of Adversarial Examples

Figure 4 for Making Substitute Models More Bayesian Can Enhance Transferability of Adversarial Examples

The transferability of adversarial examples across deep neural networks (DNNs) is the crux of many black-box attacks. Many prior efforts have been devoted to improving the transferability via increasing the diversity in inputs of some substitute models. In this paper, by contrast, we opt for the diversity in substitute models and advocate to attack a Bayesian model for achieving desirable transferability. Deriving from the Bayesian formulation, we develop a principled strategy for possible finetuning, which can be combined with many off-the-shelf Gaussian posterior approximations over DNN parameters. Extensive experiments have been conducted to verify the effectiveness of our method, on common benchmark datasets, and the results demonstrate that our method outperforms recent state-of-the-arts by large margins (roughly 19% absolute increase in average attack success rate on ImageNet), and, by combining with these recent methods, further performance gain can be obtained. Our code: https://github.com/qizhangli/MoreBayesian-attack.

* Accepted by ICLR 2023

Via

Access Paper or Ask Questions

Position-Aware Contrastive Alignment for Referring Image Segmentation

Dec 27, 2022
Bo Chen, Zhiwei Hu, Zhilong Ji, Jinfeng Bai, Wangmeng Zuo

Figure 1 for Position-Aware Contrastive Alignment for Referring Image Segmentation

Figure 2 for Position-Aware Contrastive Alignment for Referring Image Segmentation

Figure 3 for Position-Aware Contrastive Alignment for Referring Image Segmentation

Figure 4 for Position-Aware Contrastive Alignment for Referring Image Segmentation

Referring image segmentation aims to segment the target object described by a given natural language expression. Typically, referring expressions contain complex relationships between the target and its surrounding objects. The main challenge of this task is to understand the visual and linguistic content simultaneously and to find the referred object accurately among all instances in the image. Currently, the most effective way to solve the above problem is to obtain aligned multi-modal features by computing the correlation between visual and linguistic feature modalities under the supervision of the ground-truth mask. However, existing paradigms have difficulty in thoroughly understanding visual and linguistic content due to the inability to perceive information directly about surrounding objects that refer to the target. This prevents them from learning aligned multi-modal features, which leads to inaccurate segmentation. To address this issue, we present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features by guiding the interaction between vision and language through prior position information. Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment by comparing the features of the referred object with those of related objects. Extensive experiments on three benchmarks demonstrate our PCAN performs favorably against the state-of-the-art methods. Our code will be made publicly available.

* 12 pages, 6 figures

Via

Access Paper or Ask Questions

Benchmark Dataset and Effective Inter-Frame Alignment for Real-World Video Super-Resolution

Dec 10, 2022
Ruohao Wang, Xiaohui Liu, Zhilu Zhang, Xiaohe Wu, Chun-Mei Feng, Lei Zhang, Wangmeng Zuo

Figure 1 for Benchmark Dataset and Effective Inter-Frame Alignment for Real-World Video Super-Resolution

Figure 2 for Benchmark Dataset and Effective Inter-Frame Alignment for Real-World Video Super-Resolution

Figure 3 for Benchmark Dataset and Effective Inter-Frame Alignment for Real-World Video Super-Resolution

Figure 4 for Benchmark Dataset and Effective Inter-Frame Alignment for Real-World Video Super-Resolution

Video super-resolution (VSR) aiming to reconstruct a high-resolution (HR) video from its low-resolution (LR) counterpart has made tremendous progress in recent years. However, it remains challenging to deploy existing VSR methods to real-world data with complex degradations. On the one hand, there are few well-aligned real-world VSR datasets, especially with large super-resolution scale factors, which limits the development of real-world VSR tasks. On the other hand, alignment algorithms in existing VSR methods perform poorly for real-world videos, leading to unsatisfactory results. As an attempt to address the aforementioned issues, we build a real-world 4 VSR dataset, namely MVSR4$\times$, where low- and high-resolution videos are captured with different focal length lenses of a smartphone, respectively. Moreover, we propose an effective alignment method for real-world VSR, namely EAVSR. EAVSR takes the proposed multi-layer adaptive spatial transform network (MultiAdaSTN) to refine the offsets provided by the pre-trained optical flow estimation network. Experimental results on RealVSR and MVSR4$\times$ datasets show the effectiveness and practicality of our method, and we achieve state-of-the-art performance in real-world VSR task. The dataset and code will be publicly available.

Via

Access Paper or Ask Questions

Learning Single Image Defocus Deblurring with Misaligned Training Pairs

Nov 29, 2022
Yu Li, Dongwei Ren, Xinya Shu, Wangmeng Zuo

Figure 1 for Learning Single Image Defocus Deblurring with Misaligned Training Pairs

Figure 2 for Learning Single Image Defocus Deblurring with Misaligned Training Pairs

Figure 3 for Learning Single Image Defocus Deblurring with Misaligned Training Pairs

Figure 4 for Learning Single Image Defocus Deblurring with Misaligned Training Pairs

By adopting popular pixel-wise loss, existing methods for defocus deblurring heavily rely on well aligned training image pairs. Although training pairs of ground-truth and blurry images are carefully collected, e.g., DPDD dataset, misalignment is inevitable between training pairs, making existing methods possibly suffer from deformation artifacts. In this paper, we propose a joint deblurring and reblurring learning (JDRL) framework for single image defocus deblurring with misaligned training pairs. Generally, JDRL consists of a deblurring module and a spatially invariant reblurring module, by which deblurred result can be adaptively supervised by ground-truth image to recover sharp textures while maintaining spatial consistency with the blurry image. First, in the deblurring module, a bi-directional optical flow-based deformation is introduced to tolerate spatial misalignment between deblurred and ground-truth images. Second, in the reblurring module, deblurred result is reblurred to be spatially aligned with blurry image, by predicting a set of isotropic blur kernels and weighting maps. Moreover, we establish a new single image defocus deblurring (SDD) dataset, further validating our JDRL and also benefiting future research. Our JDRL can be applied to boost defocus deblurring networks in terms of both quantitative metrics and visual quality on DPDD, RealDOF and our SDD datasets.

* https://github.com/liyucs/JDRL

Via

Access Paper or Ask Questions

Texts as Images in Prompt Tuning for Multi-Label Image Recognition

Nov 23, 2022
Zixian Guo, Bowen Dong, Zhilong Ji, Jinfeng Bai, Yiwen Guo, Wangmeng Zuo

Figure 1 for Texts as Images in Prompt Tuning for Multi-Label Image Recognition

Figure 2 for Texts as Images in Prompt Tuning for Multi-Label Image Recognition

Figure 3 for Texts as Images in Prompt Tuning for Multi-Label Image Recognition

Figure 4 for Texts as Images in Prompt Tuning for Multi-Label Image Recognition

Prompt tuning has been employed as an efficient way to adapt large vision-language pre-trained models (e.g. CLIP) to various downstream tasks in data-limited or label-limited settings. Nonetheless, visual data (e.g., images) is by default prerequisite for learning prompts in existing methods. In this work, we advocate that the effectiveness of image-text contrastive learning in aligning the two modalities (for training CLIP) further makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting. In contrast to the visual data, text descriptions are easy to collect, and their class labels can be directly derived. Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning. Moreover, with TaI, double-grained prompt tuning (TaI-DPT) is further presented to extract both coarse-grained and fine-grained embeddings for enhancing the multi-label recognition performance. Experimental results show that our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks, e.g., MS-COCO, VOC2007, and NUS-WIDE, while it can be combined with existing methods of prompting from images to improve recognition performance further. Code is released at https://github.com/guozix/TaI-DPT.

Via

Access Paper or Ask Questions

Self-Supervised Image Restoration with Blurry and Noisy Pairs

Nov 14, 2022
Zhilu Zhang, Rongjian Xu, Ming Liu, Zifei Yan, Wangmeng Zuo

Figure 1 for Self-Supervised Image Restoration with Blurry and Noisy Pairs

Figure 2 for Self-Supervised Image Restoration with Blurry and Noisy Pairs

Figure 3 for Self-Supervised Image Restoration with Blurry and Noisy Pairs

Figure 4 for Self-Supervised Image Restoration with Blurry and Noisy Pairs

When taking photos under an environment with insufficient light, the exposure time and the sensor gain usually require to be carefully chosen to obtain images with satisfying visual quality. For example, the images with high ISO usually have inescapable noise, while the long-exposure ones may be blurry due to camera shake or object motion. Existing solutions generally suggest to seek a balance between noise and blur, and learn denoising or deblurring models under either full- or self-supervision. However, the real-world training pairs are difficult to collect, and the self-supervised methods merely rely on blurry or noisy images are limited in performance. In this work, we tackle this problem by jointly leveraging the short-exposure noisy image and the long-exposure blurry image for better image restoration. Such setting is practically feasible due to that short-exposure and long-exposure images can be either acquired by two individual cameras or synthesized by a long burst of images. Moreover, the short-exposure images are hardly blurry, and the long-exposure ones have negligible noise. Their complementarity makes it feasible to learn restoration model in a self-supervised manner. Specifically, the noisy images can be used as the supervision information for deblurring, while the sharp areas in the blurry images can be utilized as the auxiliary supervision information for self-supervised denoising. By learning in a collaborative manner, the deblurring and denoising tasks in our method can benefit each other. Experiments on synthetic and real-world images show the effectiveness and practicality of the proposed method. Codes are available at https://github.com/cszhilu1998/SelfIR.

* NeurIPS 2022

Via

Access Paper or Ask Questions

Reversed Image Signal Processing and RAW Reconstruction. AIM 2022 Challenge Report

Oct 20, 2022
Marcos V. Conde, Radu Timofte, Yibin Huang, Jingyang Peng, Chang Chen, Cheng Li, Eduardo Pérez-Pellitero, Fenglong Song, Furui Bai, Shuai Liu, Chaoyu Feng, Xiaotao Wang, Lei Lei, Yu Zhu, Chenghua Li, Yingying Jiang, Yong A, Peisong Wang, Cong Leng, Jian Cheng, Xiaoyu Liu, Zhicun Yin, Zhilu Zhang, Junyi Li, Ming Liu, Wangmeng Zuo, Jun Jiang, Jinha Kim, Yue Zhang, Beiji Zou, Zhikai Zong, Xiaoxiao Liu, Juan Marín Vega, Michael Sloth, Peter Schneider-Kamp, Richard Röttger, Furkan Kınlı, Barış Özcan, Furkan Kıraç, Li Leyi, SM Nadim Uddin, Dipon Kumar Ghosh, Yong Ju Jung

Figure 1 for Reversed Image Signal Processing and RAW Reconstruction. AIM 2022 Challenge Report

Figure 2 for Reversed Image Signal Processing and RAW Reconstruction. AIM 2022 Challenge Report

Figure 3 for Reversed Image Signal Processing and RAW Reconstruction. AIM 2022 Challenge Report

Figure 4 for Reversed Image Signal Processing and RAW Reconstruction. AIM 2022 Challenge Report

Cameras capture sensor RAW images and transform them into pleasant RGB images, suitable for the human eyes, using their integrated Image Signal Processor (ISP). Numerous low-level vision tasks operate in the RAW domain (e.g. image denoising, white balance) due to its linear relationship with the scene irradiance, wide-range of information at 12bits, and sensor designs. Despite this, RAW image datasets are scarce and more expensive to collect than the already large and public RGB datasets. This paper introduces the AIM 2022 Challenge on Reversed Image Signal Processing and RAW Reconstruction. We aim to recover raw sensor images from the corresponding RGBs without metadata and, by doing this, "reverse" the ISP transformation. The proposed methods and benchmark establish the state-of-the-art for this low-level vision inverse problem, and generating realistic raw sensor readings can potentially benefit other tasks such as denoising and super-resolution.

* ECCV 2022 Advances in Image Manipulation (AIM) workshop

Via

Access Paper or Ask Questions

Learning Dual Memory Dictionaries for Blind Face Restoration

Oct 15, 2022
Xiaoming Li, Shiguang Zhang, Shangchen Zhou, Lei Zhang, Wangmeng Zuo

Figure 1 for Learning Dual Memory Dictionaries for Blind Face Restoration

Figure 2 for Learning Dual Memory Dictionaries for Blind Face Restoration

Figure 3 for Learning Dual Memory Dictionaries for Blind Face Restoration

Figure 4 for Learning Dual Memory Dictionaries for Blind Face Restoration

To improve the performance of blind face restoration, recent works mainly treat the two aspects, i.e., generic and specific restoration, separately. In particular, generic restoration attempts to restore the results through general facial structure prior, while on the one hand, cannot generalize to real-world degraded observations due to the limited capability of direct CNNs' mappings in learning blind restoration, and on the other hand, fails to exploit the identity-specific details. On the contrary, specific restoration aims to incorporate the identity features from the reference of the same identity, in which the requirement of proper reference severely limits the application scenarios. Generally, it is a challenging and intractable task to improve the photo-realistic performance of blind restoration and adaptively handle the generic and specific restoration scenarios with a single unified model. Instead of implicitly learning the mapping from a low-quality image to its high-quality counterpart, this paper suggests a DMDNet by explicitly memorizing the generic and specific features through dual dictionaries. First, the generic dictionary learns the general facial priors from high-quality images of any identity, while the specific dictionary stores the identity-belonging features for each person individually. Second, to handle the degraded input with or without specific reference, dictionary transform module is suggested to read the relevant details from the dual dictionaries which are subsequently fused into the input features. Finally, multi-scale dictionaries are leveraged to benefit the coarse-to-fine restoration. Moreover, a new high-quality dataset, termed CelebRef-HQ, is constructed to promote the exploration of specific face restoration in the high-resolution space.

* IEEE TPAMI 2022. Code and dataset: https://github.com/csxmli2016/DMDNet

Via

Access Paper or Ask Questions

ImaginaryNet: Learning Object Detectors without Real Images and Annotations

Oct 13, 2022
Minheng Ni, Zitong Huang, Kailai Feng, Wangmeng Zuo

Figure 1 for ImaginaryNet: Learning Object Detectors without Real Images and Annotations

Figure 2 for ImaginaryNet: Learning Object Detectors without Real Images and Annotations

Figure 3 for ImaginaryNet: Learning Object Detectors without Real Images and Annotations

Figure 4 for ImaginaryNet: Learning Object Detectors without Real Images and Annotations

Without the demand of training in reality, humans can easily detect a known concept simply based on its language description. Empowering deep learning with this ability undoubtedly enables the neural network to handle complex vision tasks, e.g., object detection, without collecting and annotating real images. To this end, this paper introduces a novel challenging learning paradigm Imaginary-Supervised Object Detection (ISOD), where neither real images nor manual annotations are allowed for training object detectors. To resolve this challenge, we propose ImaginaryNet, a framework to synthesize images by combining pretrained language model and text-to-image synthesis model. Given a class label, the language model is used to generate a full description of a scene with a target object, and the text-to-image model deployed to generate a photo-realistic image. With the synthesized images and class labels, weakly supervised object detection can then be leveraged to accomplish ISOD. By gradually introducing real images and manual annotations, ImaginaryNet can collaborate with other supervision settings to further boost detection performance. Experiments show that ImaginaryNet can (i) obtain about 70% performance in ISOD compared with the weakly supervised counterpart of the same backbone trained on real data, (ii) significantly improve the baseline while achieving state-of-the-art or comparable performance by incorporating ImaginaryNet with other supervision settings.

* 12 pages, 6 figures

Via

Access Paper or Ask Questions