Alert button
Picture for Jiantao Zhou

Jiantao Zhou

Alert button

HAT: Hybrid Attention Transformer for Image Restoration

Sep 11, 2023
Xiangyu Chen, Xintao Wang, Wenlong Zhang, Xiangtao Kong, Yu Qiao, Jiantao Zhou, Chao Dong

Figure 1 for HAT: Hybrid Attention Transformer for Image Restoration
Figure 2 for HAT: Hybrid Attention Transformer for Image Restoration
Figure 3 for HAT: Hybrid Attention Transformer for Image Restoration
Figure 4 for HAT: Hybrid Attention Transformer for Image Restoration

Transformer-based methods have shown impressive performance in image restoration tasks, such as image super-resolution and denoising. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for better restoration, we propose a new Hybrid Attention Transformer (HAT). It combines both channel attention and window-based self-attention schemes, thus making use of their complementary advantages. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally adopt a same-task pre-training strategy to further exploit the potential of the model for further improvement. Extensive experiments have demonstrated the effectiveness of the proposed modules. We further scale up the model to show that the performance of the SR task can be greatly improved. Besides, we extend HAT to more image restoration applications, including real-world image super-resolution, Gaussian image denoising and image compression artifacts reduction. Experiments on benchmark and real-world datasets demonstrate that our HAT achieves state-of-the-art performance both quantitatively and qualitatively. Codes and models are publicly available at https://github.com/XPixelGroup/HAT.

* Extended version of HAT 
Viaarxiv icon

CNN Injected Transformer for Image Exposure Correction

Sep 08, 2023
Shuning Xu, Xiangyu Chen, Binbin Song, Jiantao Zhou

Figure 1 for CNN Injected Transformer for Image Exposure Correction
Figure 2 for CNN Injected Transformer for Image Exposure Correction
Figure 3 for CNN Injected Transformer for Image Exposure Correction
Figure 4 for CNN Injected Transformer for Image Exposure Correction

Capturing images with incorrect exposure settings fails to deliver a satisfactory visual experience. Only when the exposure is properly set, can the color and details of the images be appropriately preserved. Previous exposure correction methods based on convolutions often produce exposure deviation in images as a consequence of the restricted receptive field of convolutional kernels. This issue arises because convolutions are not capable of capturing long-range dependencies in images accurately. To overcome this challenge, we can apply the Transformer to address the exposure correction problem, leveraging its capability in modeling long-range dependencies to capture global representation. However, solely relying on the window-based Transformer leads to visually disturbing blocking artifacts due to the application of self-attention in small patches. In this paper, we propose a CNN Injected Transformer (CIT) to harness the individual strengths of CNN and Transformer simultaneously. Specifically, we construct the CIT by utilizing a window-based Transformer to exploit the long-range interactions among different regions in the entire image. Within each CIT block, we incorporate a channel attention block (CAB) and a half-instance normalization block (HINB) to assist the window-based self-attention to acquire the global statistics and refine local features. In addition to the hybrid architecture design for exposure correction, we apply a set of carefully formulated loss functions to improve the spatial coherence and rectify potential color deviations. Extensive experiments demonstrate that our image exposure correction method outperforms state-of-the-art approaches in terms of both quantitative and qualitative metrics.

Viaarxiv icon

Towards Efficient SDRTV-to-HDRTV by Learning from Image Formation

Sep 08, 2023
Xiangyu Chen, Zheyuan Li, Zhengwen Zhang, Jimmy S. Ren, Yihao Liu, Jingwen He, Yu Qiao, Jiantao Zhou, Chao Dong

Modern displays are capable of rendering video content with high dynamic range (HDR) and wide color gamut (WCG). However, the majority of available resources are still in standard dynamic range (SDR). As a result, there is significant value in transforming existing SDR content into the HDRTV standard. In this paper, we define and analyze the SDRTV-to-HDRTV task by modeling the formation of SDRTV/HDRTV content. Our analysis and observations indicate that a naive end-to-end supervised training pipeline suffers from severe gamut transition errors. To address this issue, we propose a novel three-step solution pipeline called HDRTVNet++, which includes adaptive global color mapping, local enhancement, and highlight refinement. The adaptive global color mapping step uses global statistics as guidance to perform image-adaptive color mapping. A local enhancement network is then deployed to enhance local details. Finally, we combine the two sub-networks above as a generator and achieve highlight consistency through GAN-based joint training. Our method is primarily designed for ultra-high-definition TV content and is therefore effective and lightweight for processing 4K resolution images. We also construct a dataset using HDR videos in the HDR10 standard, named HDRTV1K that contains 1235 and 117 training images and 117 testing images, all in 4K resolution. Besides, we select five metrics to evaluate the results of SDRTV-to-HDRTV algorithms. Our final results demonstrate state-of-the-art performance both quantitatively and visually. The code, model and dataset are available at https://github.com/xiaom233/HDRTVNet-plus.

* Extended version of HDRTVNet 
Viaarxiv icon

Direction-aware Video Demoireing with Temporal-guided Bilateral Learning

Aug 25, 2023
Shuning Xu, Binbin Song, Xiangyu Chen, Jiantao Zhou

Figure 1 for Direction-aware Video Demoireing with Temporal-guided Bilateral Learning
Figure 2 for Direction-aware Video Demoireing with Temporal-guided Bilateral Learning
Figure 3 for Direction-aware Video Demoireing with Temporal-guided Bilateral Learning
Figure 4 for Direction-aware Video Demoireing with Temporal-guided Bilateral Learning

Moire patterns occur when capturing images or videos on screens, severely degrading the quality of the captured images or videos. Despite the recent progresses, existing video demoireing methods neglect the physical characteristics and formation process of moire patterns, significantly limiting the effectiveness of video recovery. This paper presents a unified framework, DTNet, a direction-aware and temporal-guided bilateral learning network for video demoireing. DTNet effectively incorporates the process of moire pattern removal, alignment, color correction, and detail refinement. Our proposed DTNet comprises two primary stages: Frame-level Direction-aware Demoireing and Alignment (FDDA) and Tone and Detail Refinement (TDR). In FDDA, we employ multiple directional DCT modes to perform the moire pattern removal process in the frequency domain, effectively detecting the prominent moire edges. Then, the coarse and fine-grained alignment is applied on the demoired features for facilitating the utilization of neighboring information. In TDR, we propose a temporal-guided bilateral learning pipeline to mitigate the degradation of color and details caused by the moire patterns while preserving the restored frequency information in FDDA. Guided by the aligned temporal features from FDDA, the affine transformations for the recovery of the ultimate clean frames are learned in TDR. Extensive experiments demonstrate that our video demoireing method outperforms state-of-the-art approaches by 2.3 dB in PSNR, and also delivers a superior visual experience.

Viaarxiv icon

Rethinking Image Forgery Detection via Contrastive Learning and Unsupervised Clustering

Aug 18, 2023
Haiwei Wu, Yiming Chen, Jiantao Zhou

Figure 1 for Rethinking Image Forgery Detection via Contrastive Learning and Unsupervised Clustering
Figure 2 for Rethinking Image Forgery Detection via Contrastive Learning and Unsupervised Clustering
Figure 3 for Rethinking Image Forgery Detection via Contrastive Learning and Unsupervised Clustering
Figure 4 for Rethinking Image Forgery Detection via Contrastive Learning and Unsupervised Clustering

Image forgery detection aims to detect and locate forged regions in an image. Most existing forgery detection algorithms formulate classification problems to classify pixels into forged or pristine. However, the definition of forged and pristine pixels is only relative within one single image, e.g., a forged region in image A is actually a pristine one in its source image B (splicing forgery). Such a relative definition has been severely overlooked by existing methods, which unnecessarily mix forged (pristine) regions across different images into the same category. To resolve this dilemma, we propose the FOrensic ContrAstive cLustering (FOCAL) method, a novel, simple yet very effective paradigm based on contrastive learning and unsupervised clustering for the image forgery detection. Specifically, FOCAL 1) utilizes pixel-level contrastive learning to supervise the high-level forensic feature extraction in an image-by-image manner, explicitly reflecting the above relative definition; 2) employs an on-the-fly unsupervised clustering algorithm (instead of a trained one) to cluster the learned features into forged/pristine categories, further suppressing the cross-image influence from training data; and 3) allows to further boost the detection performance via simple feature-level concatenation without the need of retraining. Extensive experimental results over six public testing datasets demonstrate that our proposed FOCAL significantly outperforms the state-of-the-art competing algorithms by big margins: +24.3% on Coverage, +18.6% on Columbia, +17.5% on FF++, +14.2% on MISD, +13.5% on CASIA and +10.3% on NIST in terms of IoU. The paradigm of FOCAL could bring fresh insights and serve as a novel benchmark for the image forgery detection task. The code is available at https://github.com/HighwayWu/FOCAL.

Viaarxiv icon

Under-Display Camera Image Restoration with Scattering Effect

Aug 08, 2023
Binbin Song, Xiangyu Chen, Shuning Xu, Jiantao Zhou

Figure 1 for Under-Display Camera Image Restoration with Scattering Effect
Figure 2 for Under-Display Camera Image Restoration with Scattering Effect
Figure 3 for Under-Display Camera Image Restoration with Scattering Effect
Figure 4 for Under-Display Camera Image Restoration with Scattering Effect

The under-display camera (UDC) provides consumers with a full-screen visual experience without any obstruction due to notches or punched holes. However, the semi-transparent nature of the display inevitably introduces the severe degradation into UDC images. In this work, we address the UDC image restoration problem with the specific consideration of the scattering effect caused by the display. We explicitly model the scattering effect by treating the display as a piece of homogeneous scattering medium. With the physical model of the scattering effect, we improve the image formation pipeline for the image synthesis to construct a realistic UDC dataset with ground truths. To suppress the scattering effect for the eventual UDC image recovery, a two-branch restoration network is designed. More specifically, the scattering branch leverages global modeling capabilities of the channel-wise self-attention to estimate parameters of the scattering effect from degraded images. While the image branch exploits the local representation advantage of CNN to recover clear scenes, implicitly guided by the scattering branch. Extensive experiments are conducted on both real-world and synthesized data, demonstrating the superiority of the proposed method over the state-of-the-art UDC restoration techniques. The source code and dataset are available at \url{https://github.com/NamecantbeNULL/SRUDC}.

* Accepted to ICCV2023 
Viaarxiv icon

A Memory-Augmented Multi-Task Collaborative Framework for Unsupervised Traffic Accident Detection in Driving Videos

Jul 27, 2023
Rongqin Liang, Yuanman Li, Yingxin Yi, Jiantao Zhou, Xia Li

Figure 1 for A Memory-Augmented Multi-Task Collaborative Framework for Unsupervised Traffic Accident Detection in Driving Videos
Figure 2 for A Memory-Augmented Multi-Task Collaborative Framework for Unsupervised Traffic Accident Detection in Driving Videos
Figure 3 for A Memory-Augmented Multi-Task Collaborative Framework for Unsupervised Traffic Accident Detection in Driving Videos
Figure 4 for A Memory-Augmented Multi-Task Collaborative Framework for Unsupervised Traffic Accident Detection in Driving Videos

Identifying traffic accidents in driving videos is crucial to ensuring the safety of autonomous driving and driver assistance systems. To address the potential danger caused by the long-tailed distribution of driving events, existing traffic accident detection (TAD) methods mainly rely on unsupervised learning. However, TAD is still challenging due to the rapid movement of cameras and dynamic scenes in driving scenarios. Existing unsupervised TAD methods mainly rely on a single pretext task, i.e., an appearance-based or future object localization task, to detect accidents. However, appearance-based approaches are easily disturbed by the rapid movement of the camera and changes in illumination, which significantly reduce the performance of traffic accident detection. Methods based on future object localization may fail to capture appearance changes in video frames, making it difficult to detect ego-involved accidents (e.g., out of control of the ego-vehicle). In this paper, we propose a novel memory-augmented multi-task collaborative framework (MAMTCF) for unsupervised traffic accident detection in driving videos. Different from previous approaches, our method can more accurately detect both ego-involved and non-ego accidents by simultaneously modeling appearance changes and object motions in video frames through the collaboration of optical flow reconstruction and future object localization tasks. Further, we introduce a memory-augmented motion representation mechanism to fully explore the interrelation between different types of motion representations and exploit the high-level features of normal traffic patterns stored in memory to augment motion representations, thus enlarging the difference from anomalies. Experimental results on recently published large-scale dataset demonstrate that our method achieves better performance compared to previous state-of-the-art approaches.

* 12pages,5 figures 
Viaarxiv icon

DeSRA: Detect and Delete the Artifacts of GAN-based Real-World Super-Resolution Models

Jul 05, 2023
Liangbin Xie, Xintao Wang, Xiangyu Chen, Gen Li, Ying Shan, Jiantao Zhou, Chao Dong

Figure 1 for DeSRA: Detect and Delete the Artifacts of GAN-based Real-World Super-Resolution Models
Figure 2 for DeSRA: Detect and Delete the Artifacts of GAN-based Real-World Super-Resolution Models
Figure 3 for DeSRA: Detect and Delete the Artifacts of GAN-based Real-World Super-Resolution Models
Figure 4 for DeSRA: Detect and Delete the Artifacts of GAN-based Real-World Super-Resolution Models

Image super-resolution (SR) with generative adversarial networks (GAN) has achieved great success in restoring realistic details. However, it is notorious that GAN-based SR models will inevitably produce unpleasant and undesirable artifacts, especially in practical scenarios. Previous works typically suppress artifacts with an extra loss penalty in the training phase. They only work for in-distribution artifact types generated during training. When applied in real-world scenarios, we observe that those improved methods still generate obviously annoying artifacts during inference. In this paper, we analyze the cause and characteristics of the GAN artifacts produced in unseen test data without ground-truths. We then develop a novel method, namely, DeSRA, to Detect and then Delete those SR Artifacts in practice. Specifically, we propose to measure a relative local variance distance from MSE-SR results and GAN-SR results, and locate the problematic areas based on the above distance and semantic-aware thresholds. After detecting the artifact regions, we develop a finetune procedure to improve GAN-based SR models with a few samples, so that they can deal with similar types of artifacts in more unseen real data. Equipped with our DeSRA, we can successfully eliminate artifacts from inference and improve the ability of SR models to be applied in real-world scenarios. The code will be available at https://github.com/TencentARC/DeSRA.

* The code and models will be made publicly at https://github.com/TencentARC/DeSRA 
Viaarxiv icon

Common Knowledge Learning for Generating Transferable Adversarial Examples

Jul 01, 2023
Ruijie Yang, Yuanfang Guo, Junfu Wang, Jiantao Zhou, Yunhong Wang

Figure 1 for Common Knowledge Learning for Generating Transferable Adversarial Examples
Figure 2 for Common Knowledge Learning for Generating Transferable Adversarial Examples
Figure 3 for Common Knowledge Learning for Generating Transferable Adversarial Examples
Figure 4 for Common Knowledge Learning for Generating Transferable Adversarial Examples

This paper focuses on an important type of black-box attacks, i.e., transfer-based adversarial attacks, where the adversary generates adversarial examples by a substitute (source) model and utilize them to attack an unseen target model, without knowing its information. Existing methods tend to give unsatisfactory adversarial transferability when the source and target models are from different types of DNN architectures (e.g. ResNet-18 and Swin Transformer). In this paper, we observe that the above phenomenon is induced by the output inconsistency problem. To alleviate this problem while effectively utilizing the existing DNN models, we propose a common knowledge learning (CKL) framework to learn better network weights to generate adversarial examples with better transferability, under fixed network architectures. Specifically, to reduce the model-specific features and obtain better output distributions, we construct a multi-teacher framework, where the knowledge is distilled from different teacher architectures into one student network. By considering that the gradient of input is usually utilized to generated adversarial examples, we impose constraints on the gradients between the student and teacher models, to further alleviate the output inconsistency problem and enhance the adversarial transferability. Extensive experiments demonstrate that our proposed work can significantly improve the adversarial transferability.

* 11 pages, 5 figures 
Viaarxiv icon

Generalizable Synthetic Image Detection via Language-guided Contrastive Learning

May 23, 2023
Haiwei Wu, Jiantao Zhou, Shile Zhang

Figure 1 for Generalizable Synthetic Image Detection via Language-guided Contrastive Learning
Figure 2 for Generalizable Synthetic Image Detection via Language-guided Contrastive Learning
Figure 3 for Generalizable Synthetic Image Detection via Language-guided Contrastive Learning
Figure 4 for Generalizable Synthetic Image Detection via Language-guided Contrastive Learning

The heightened realism of AI-generated images can be attributed to the rapid development of synthetic models, including generative adversarial networks (GANs) and diffusion models (DMs). The malevolent use of synthetic images, such as the dissemination of fake news or the creation of fake profiles, however, raises significant concerns regarding the authenticity of images. Though many forensic algorithms have been developed for detecting synthetic images, their performance, especially the generalization capability, is still far from being adequate to cope with the increasing number of synthetic models. In this work, we propose a simple yet very effective synthetic image detection method via a language-guided contrastive learning and a new formulation of the detection problem. We first augment the training images with carefully-designed textual labels, enabling us to use a joint image-text contrastive learning for the forensic feature extraction. In addition, we formulate the synthetic image detection as an identification problem, which is vastly different from the traditional classification-based approaches. It is shown that our proposed LanguAge-guided SynThEsis Detection (LASTED) model achieves much improved generalizability to unseen image generation models and delivers promising performance that far exceeds state-of-the-art competitors by +22.66% accuracy and +15.24% AUC. The code is available at https://github.com/HighwayWu/LASTED.

Viaarxiv icon