Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wangmeng Zuo

Self-supervised Learning to Bring Dual Reversed Rolling Shutter Images Alive

May 31, 2023

Wei Shang, Dongwei Ren, Chaoyu Feng, Xiaotao Wang, Lei Lei, Wangmeng Zuo

Figure 1 for Self-supervised Learning to Bring Dual Reversed Rolling Shutter Images Alive

Figure 2 for Self-supervised Learning to Bring Dual Reversed Rolling Shutter Images Alive

Figure 3 for Self-supervised Learning to Bring Dual Reversed Rolling Shutter Images Alive

Figure 4 for Self-supervised Learning to Bring Dual Reversed Rolling Shutter Images Alive

Abstract:Modern consumer cameras usually employ the rolling shutter (RS) mechanism, where images are captured by scanning scenes row-by-row, yielding RS distortions for dynamic scenes. To correct RS distortions, existing methods adopt a fully supervised learning manner, where high framerate global shutter (GS) images should be collected as ground-truth supervision. In this paper, we propose a Self-supervised learning framework for Dual reversed RS distortions Correction (SelfDRSC), where a DRSC network can be learned to generate a high framerate GS video only based on dual RS images with reversed distortions. In particular, a bidirectional distortion warping module is proposed for reconstructing dual reversed RS images, and then a self-supervised loss can be deployed to train DRSC network by enhancing the cycle consistency between input and reconstructed dual reversed RS images. Besides start and end RS scanning time, GS images at arbitrary intermediate scanning time can also be supervised in SelfDRSC, thus enabling the learned DRSC network to generate a high framerate GS video. Moreover, a simple yet effective self-distillation strategy is introduced in self-supervised loss for mitigating boundary artifacts in generated GS images. On synthetic dataset, SelfDRSC achieves better or comparable quantitative metrics in comparison to state-of-the-art methods trained in the full supervision manner. On real-world RS cases, our SelfDRSC can produce high framerate GS videos with finer correction textures and better temporary consistency. The source code and trained models are made publicly available at https://github.com/shangwei5/SelfDRSC.

* 16 pages, 16 figures, available at https://github.com/shangwei5/SelfDRSC

Via

Access Paper or Ask Questions

Inferring and Leveraging Parts from Object Shape for Improving Semantic Image Synthesis

May 31, 2023

Yuxiang Wei, Zhilong Ji, Xiaohe Wu, Jinfeng Bai, Lei Zhang, Wangmeng Zuo

Figure 1 for Inferring and Leveraging Parts from Object Shape for Improving Semantic Image Synthesis

Figure 2 for Inferring and Leveraging Parts from Object Shape for Improving Semantic Image Synthesis

Figure 3 for Inferring and Leveraging Parts from Object Shape for Improving Semantic Image Synthesis

Figure 4 for Inferring and Leveraging Parts from Object Shape for Improving Semantic Image Synthesis

Abstract:Despite the progress in semantic image synthesis, it remains a challenging problem to generate photo-realistic parts from input semantic map. Integrating part segmentation map can undoubtedly benefit image synthesis, but is bothersome and inconvenient to be provided by users. To improve part synthesis, this paper presents to infer Parts from Object ShapE (iPOSE) and leverage it for improving semantic image synthesis. However, albeit several part segmentation datasets are available, part annotations are still not provided for many object categories in semantic image synthesis. To circumvent it, we resort to few-shot regime to learn a PartNet for predicting the object part map with the guidance of pre-defined support part maps. PartNet can be readily generalized to handle a new object category when a small number (e.g., 3) of support part maps for this category are provided. Furthermore, part semantic modulation is presented to incorporate both inferred part map and semantic map for image synthesis. Experiments show that our iPOSE not only generates objects with rich part details, but also enables to control the image synthesis flexibly. And our iPOSE performs favorably against the state-of-the-art methods in terms of quantitative and qualitative evaluation. Our code will be publicly available at https://github.com/csyxwei/iPOSE.

* CVPR 2023. Code will be released at https://github.com/csyxwei/iPOSE

Via

Access Paper or Ask Questions

ControlVideo: Training-free Controllable Text-to-Video Generation

May 22, 2023

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, Qi Tian

Abstract:Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart still lags behind due to the excessive training cost of temporal modeling. Besides the training burden, the generated videos also suffer from appearance inconsistency and structural flickers, especially in long video synthesis. To address these challenges, we design a \emph{training-free} framework called \textbf{ControlVideo} to enable natural and efficient text-to-video generation. ControlVideo, adapted from ControlNet, leverages coarsely structural consistency from input motion sequences, and introduces three modules to improve video generation. Firstly, to ensure appearance coherence between frames, ControlVideo adds fully cross-frame interaction in self-attention modules. Secondly, to mitigate the flicker effect, it introduces an interleaved-frame smoother that employs frame interpolation on alternated frames. Finally, to produce long videos efficiently, it utilizes a hierarchical sampler that separately synthesizes each short clip with holistic coherency. Empowered with these modules, ControlVideo outperforms the state-of-the-arts on extensive motion-prompt pairs quantitatively and qualitatively. Notably, thanks to the efficient designs, it generates both short and long videos within several minutes using one NVIDIA 2080Ti. Code is available at https://github.com/YBYBZhang/ControlVideo.

* Code is available at https://github.com/YBYBZhang/ControlVideo

Via

Access Paper or Ask Questions

Improving Adversarial Transferability via Intermediate-level Perturbation Decay

May 09, 2023

Qizhang Li, Yiwen Guo, Wangmeng Zuo, Hao Chen

Abstract:Intermediate-level attacks that attempt to perturb feature representations following an adversarial direction drastically have shown favorable performance in crafting transferable adversarial examples. Existing methods in this category are normally formulated with two separate stages, where a directional guide is required to be determined at first and the scalar projection of the intermediate-level perturbation onto the directional guide is enlarged thereafter. The obtained perturbation deviates from the guide inevitably in the feature space, and it is revealed in this paper that such a deviation may lead to sub-optimal attack. To address this issue, we develop a novel intermediate-level method that crafts adversarial examples within a single stage of optimization. In particular, the proposed method, named intermediate-level perturbation decay (ILPD), encourages the intermediate-level perturbation to be in an effective adversarial direction and to possess a great magnitude simultaneously. In-depth discussion verifies the effectiveness of our method. Experimental results show that it outperforms state-of-the-arts by large margins in attacking various victim models on ImageNet (+10.07% on average) and CIFAR-10 (+3.88% on average). Our code is at https://github.com/qizhangli/ILPD-attack.

* Revision of ICML '23 submission for better clarity

Via

Access Paper or Ask Questions

Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation

Apr 03, 2023

Yabo Zhang, Zihao Wang, Jun Hao Liew, Jingjia Huang, Manyu Zhu, Jiashi Feng, Wangmeng Zuo

Figure 1 for Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation

Figure 2 for Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation

Figure 3 for Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation

Figure 4 for Associating Spatially-Consistent Grouping with Text-supervised Semantic Segmentation

Abstract:In this work, we investigate performing semantic segmentation solely through the training on image-sentence pairs. Due to the lack of dense annotations, existing text-supervised methods can only learn to group an image into semantic regions via pixel-insensitive feedback. As a result, their grouped results are coarse and often contain small spurious regions, limiting the upper-bound performance of segmentation. On the other hand, we observe that grouped results from self-supervised models are more semantically consistent and break the bottleneck of existing methods. Motivated by this, we introduce associate self-supervised spatially-consistent grouping with text-supervised semantic segmentation. Considering the part-like grouped results, we further adapt a text-supervised model from image-level to region-level recognition with two core designs. First, we encourage fine-grained alignment with a one-way noun-to-region contrastive loss, which reduces the mismatched noun-region pairs. Second, we adopt a contextually aware masking strategy to enable simultaneous recognition of all grouped regions. Coupled with spatially-consistent grouping and region-adapted recognition, our method achieves 59.2% mIoU and 32.4% mIoU on Pascal VOC and Pascal Context benchmarks, significantly surpassing the state-of-the-art methods.

Via

Access Paper or Ask Questions

Learning Federated Visual Prompt in Null Space for MRI Reconstruction

Mar 30, 2023

Chun-Mei Feng, Bangjun Li, Xinxing Xu, Yong Liu, Huazhu Fu, Wangmeng Zuo

Abstract:Federated Magnetic Resonance Imaging (MRI) reconstruction enables multiple hospitals to collaborate distributedly without aggregating local data, thereby protecting patient privacy. However, the data heterogeneity caused by different MRI protocols, insufficient local training data, and limited communication bandwidth inevitably impair global model convergence and updating. In this paper, we propose a new algorithm, FedPR, to learn federated visual prompts in the null space of global prompt for MRI reconstruction. FedPR is a new federated paradigm that adopts a powerful pre-trained model while only learning and communicating the prompts with few learnable parameters, thereby significantly reducing communication costs and achieving competitive performance on limited local data. Moreover, to deal with catastrophic forgetting caused by data heterogeneity, FedPR also updates efficient federated visual prompts that project the local prompts into an approximate null space of the global prompt, thereby suppressing the interference of gradients on the server performance. Extensive experiments on federated MRI show that FedPR significantly outperforms state-of-the-art FL algorithms with <6% of communication costs when given the limited amount of local training data.

* Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023
* 8 pages, Proceedings of the IEEE/CVF International Conference on Computer Vision

Via

Access Paper or Ask Questions

Joint Video Multi-Frame Interpolation and Deblurring under Unknown Exposure Time

Mar 27, 2023

Wei Shang, Dongwei Ren, Yi Yang, Hongzhi Zhang, Kede Ma, Wangmeng Zuo

Figure 1 for Joint Video Multi-Frame Interpolation and Deblurring under Unknown Exposure Time

Figure 2 for Joint Video Multi-Frame Interpolation and Deblurring under Unknown Exposure Time

Figure 3 for Joint Video Multi-Frame Interpolation and Deblurring under Unknown Exposure Time

Figure 4 for Joint Video Multi-Frame Interpolation and Deblurring under Unknown Exposure Time

Abstract:Natural videos captured by consumer cameras often suffer from low framerate and motion blur due to the combination of dynamic scene complexity, lens and sensor imperfection, and less than ideal exposure setting. As a result, computational methods that jointly perform video frame interpolation and deblurring begin to emerge with the unrealistic assumption that the exposure time is known and fixed. In this work, we aim ambitiously for a more realistic and challenging task - joint video multi-frame interpolation and deblurring under unknown exposure time. Toward this goal, we first adopt a variant of supervised contrastive learning to construct an exposure-aware representation from input blurred frames. We then train two U-Nets for intra-motion and inter-motion analysis, respectively, adapting to the learned exposure representation via gain tuning. We finally build our video reconstruction network upon the exposure and motion representation by progressive exposure-adaptive convolution and motion refinement. Extensive experiments on both simulated and real-world datasets show that our optimized method achieves notable performance gains over the state-of-the-art on the joint video x8 interpolation and deblurring task. Moreover, on the seemingly implausible x16 interpolation task, our method outperforms existing methods by more than 1.5 dB in terms of PSNR.

* Accepted by CVPR 2023, available at https://github.com/shangwei5/VIDUE

Via

Access Paper or Ask Questions

Spatially Adaptive Self-Supervised Learning for Real-World Image Denoising

Mar 27, 2023

Junyi Li, Zhilu Zhang, Xiaoyu Liu, Chaoyu Feng, Xiaotao Wang, Lei Lei, Wangmeng Zuo

Abstract:Significant progress has been made in self-supervised image denoising (SSID) in the recent few years. However, most methods focus on dealing with spatially independent noise, and they have little practicality on real-world sRGB images with spatially correlated noise. Although pixel-shuffle downsampling has been suggested for breaking the noise correlation, it breaks the original information of images, which limits the denoising performance. In this paper, we propose a novel perspective to solve this problem, i.e., seeking for spatially adaptive supervision for real-world sRGB image denoising. Specifically, we take into account the respective characteristics of flat and textured regions in noisy images, and construct supervisions for them separately. For flat areas, the supervision can be safely derived from non-adjacent pixels, which are much far from the current pixel for excluding the influence of the noise-correlated ones. And we extend the blind-spot network to a blind-neighborhood network (BNN) for providing supervision on flat areas. For textured regions, the supervision has to be closely related to the content of adjacent pixels. And we present a locally aware network (LAN) to meet the requirement, while LAN itself is selectively supervised with the output of BNN. Combining these two supervisions, a denoising network (e.g., U-Net) can be well-trained. Extensive experiments show that our method performs favorably against state-of-the-art SSID methods on real-world sRGB photographs. The code is available at https://github.com/nagejacob/SpatiallyAdaptiveSSID.

* CVPR 2023 Camera Ready

Via

Access Paper or Ask Questions

Learning Generative Structure Prior for Blind Text Image Super-resolution

Mar 26, 2023

Xiaoming Li, Wangmeng Zuo, Chen Change Loy

Abstract:Blind text image super-resolution (SR) is challenging as one needs to cope with diverse font styles and unknown degradation. To address the problem, existing methods perform character recognition in parallel to regularize the SR task, either through a loss constraint or intermediate feature condition. Nonetheless, the high-level prior could still fail when encountering severe degradation. The problem is further compounded given characters of complex structures, e.g., Chinese characters that combine multiple pictographic or ideographic symbols into a single character. In this work, we present a novel prior that focuses more on the character structure. In particular, we learn to encapsulate rich and diverse structures in a StyleGAN and exploit such generative structure priors for restoration. To restrict the generative space of StyleGAN so that it obeys the structure of characters yet remains flexible in handling different font styles, we store the discrete features for each character in a codebook. The code subsequently drives the StyleGAN to generate high-resolution structural details to aid text SR. Compared to priors based on character recognition, the proposed structure prior exerts stronger character-specific guidance to restore faithful and precise strokes of a designated character. Extensive experiments on synthetic and real datasets demonstrate the compelling performance of the proposed generative structure prior in facilitating robust text SR.

* CVPR 2023. Code: https://github.com/csxmli2016/MARCONet

Via

Access Paper or Ask Questions

Towards Universal Vision-language Omni-supervised Segmentation

Mar 12, 2023

Bowen Dong, Jiaxi Gu, Jianhua Han, Hang Xu, Wangmeng Zuo

Abstract:Existing open-world universal segmentation approaches usually leverage CLIP and pre-computed proposal masks to treat open-world segmentation tasks as proposal classification. However, 1) these works cannot handle universal segmentation in an end-to-end manner, and 2) the limited scale of panoptic datasets restricts the open-world segmentation ability on things classes. In this paper, we present Vision-Language Omni-Supervised Segmentation (VLOSS). VLOSS starts from a Mask2Former universal segmentation framework with CLIP text encoder. To improve the open-world segmentation ability, we leverage omni-supervised data (i.e., panoptic segmentation data, object detection data, and image-text pairs data) into training, thus enriching the open-world segmentation ability and achieving better segmentation accuracy. To better improve the training efficiency and fully release the power of omni-supervised data, we propose several advanced techniques, i.e., FPN-style encoder, switchable training technique, and positive classification loss. Benefiting from the end-to-end training manner with proposed techniques, VLOSS can be applied to various open-world segmentation tasks without further adaptation. Experimental results on different open-world panoptic and instance segmentation benchmarks demonstrate the effectiveness of VLOSS. Notably, with fewer parameters, our VLOSS with Swin-Tiny backbone surpasses MaskCLIP by ~2% in terms of mask AP on LVIS v1 dataset.

Via

Access Paper or Ask Questions