Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jinjin Gu

Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

Dec 14, 2023

Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, Chao Dong

Figure 1 for Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

Figure 2 for Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

Figure 3 for Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

Figure 4 for Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

Abstract:We introduce a Depicted image Quality Assessment method (DepictQA), overcoming the constraints of traditional score-based approaches. DepictQA leverages Multi-modal Large Language Models (MLLMs), allowing for detailed, language-based, human-like evaluation of image quality. Unlike conventional Image Quality Assessment (IQA) methods relying on scores, DepictQA interprets image content and distortions descriptively and comparatively, aligning closely with humans' reasoning process. To build the DepictQA model, we establish a hierarchical task framework, and collect a multi-modal IQA training dataset, named M-BAPPS. To navigate the challenges in limited training data and processing multiple images, we propose to use multi-source training data and specialized image tags. Our DepictQA demonstrates a better performance than score-based methods on the BAPPS benchmark. Moreover, compared with general MLLMs, our DepictQA can generate more accurate reasoning descriptive languages. Our research indicates that language-based IQA methods have the potential to be customized for individual preferences. Datasets and codes will be released publicly.

Via

Access Paper or Ask Questions

Binarized 3D Whole-body Human Mesh Recovery

Nov 24, 2023

Zhiteng Li, Yulun Zhang, Jing Lin, Haotong Qin, Jinjin Gu, Xin Yuan, Linghe Kong, Xiaokang Yang

Abstract:3D whole-body human mesh recovery aims to reconstruct the 3D human body, face, and hands from a single image. Although powerful deep learning models have achieved accurate estimation in this task, they require enormous memory and computational resources. Consequently, these methods can hardly be deployed on resource-limited edge devices. In this work, we propose a Binarized Dual Residual Network (BiDRN), a novel quantization method to estimate the 3D human body, face, and hands parameters efficiently. Specifically, we design a basic unit Binarized Dual Residual Block (BiDRB) composed of Local Convolution Residual (LCR) and Block Residual (BR), which can preserve full-precision information as much as possible. For LCR, we generalize it to four kinds of convolutional modules so that full-precision information can be propagated even between mismatched dimensions. We also binarize the face and hands box-prediction network as Binaried BoxNet, which can further reduce the model redundancy. Comprehensive quantitative and qualitative experiments demonstrate the effectiveness of BiDRN, which has a significant improvement over state-of-the-art binarization algorithms. Moreover, our proposed BiDRN achieves comparable performance with full-precision method Hand4Whole while using just 22.1% parameters and 14.8% operations. We will release all the code and pretrained models.

* The code will be available at https://github.com/ZHITENGLI/BiDRN

Via

Access Paper or Ask Questions

Image Super-Resolution with Text Prompt Diffusion

Nov 24, 2023

Zheng Chen, Yulun Zhang, Jinjin Gu, Xin Yuan, Linghe Kong, Guihai Chen, Xiaokang Yang

Figure 1 for Image Super-Resolution with Text Prompt Diffusion

Figure 2 for Image Super-Resolution with Text Prompt Diffusion

Figure 3 for Image Super-Resolution with Text Prompt Diffusion

Figure 4 for Image Super-Resolution with Text Prompt Diffusion

Abstract:Image super-resolution (SR) methods typically model degradation to improve reconstruction accuracy in complex and unknown degradation scenarios. However, extracting degradation information from low-resolution images is challenging, which limits the model performance. To boost image SR performance, one feasible approach is to introduce additional priors. Inspired by advancements in multi-modal methods and text prompt image processing, we introduce text prompts to image SR to provide degradation priors. Specifically, we first design a text-image generation pipeline to integrate text into SR dataset through the text degradation representation and degradation model. The text representation applies a discretization manner based on the binning method to describe the degradation abstractly. This representation method can also maintain the flexibility of language. Meanwhile, we propose the PromptSR to realize the text prompt SR. The PromptSR employs the diffusion model and the pre-trained language model (e.g., T5 and CLIP). We train the model on the generated text-image dataset. Extensive experiments indicate that introducing text prompts into image SR, yields excellent results on both synthetic and real-world images. Code: https://github.com/zhengchen1999/PromptSR.

* Code is available at https://github.com/zhengchen1999/PromptSR

Via

Access Paper or Ask Questions

Dual Aggregation Transformer for Image Super-Resolution

Aug 11, 2023

Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, Xiaokang Yang, Fisher Yu

Figure 1 for Dual Aggregation Transformer for Image Super-Resolution

Figure 2 for Dual Aggregation Transformer for Image Super-Resolution

Figure 3 for Dual Aggregation Transformer for Image Super-Resolution

Figure 4 for Dual Aggregation Transformer for Image Super-Resolution

Abstract:Transformer has recently gained considerable popularity in low-level vision tasks, including image super-resolution (SR). These networks utilize self-attention along different dimensions, spatial or channel, and achieve impressive performance. This inspires us to combine the two dimensions in Transformer for a more powerful representation capability. Based on the above idea, we propose a novel Transformer model, Dual Aggregation Transformer (DAT), for image SR. Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner. Specifically, we alternately apply spatial and channel self-attention in consecutive Transformer blocks. The alternate strategy enables DAT to capture the global context and realize inter-block feature aggregation. Furthermore, we propose the adaptive interaction module (AIM) and the spatial-gate feed-forward network (SGFN) to achieve intra-block feature aggregation. AIM complements two self-attention mechanisms from corresponding dimensions. Meanwhile, SGFN introduces additional non-linear spatial information in the feed-forward network. Extensive experiments show that our DAT surpasses current methods. Code and models are obtainable at https://github.com/zhengchen1999/DAT.

* Accepted to ICCV 2023. Code is available at https://github.com/zhengchen1999/DAT

Via

Access Paper or Ask Questions

Crafting Training Degradation Distribution for the Accuracy-Generalization Trade-off in Real-World Super-Resolution

Jun 01, 2023

Ruofan Zhang, Jinjin Gu, Haoyu Chen, Chao Dong, Yulun Zhang, Wenming Yang

Figure 1 for Crafting Training Degradation Distribution for the Accuracy-Generalization Trade-off in Real-World Super-Resolution

Figure 2 for Crafting Training Degradation Distribution for the Accuracy-Generalization Trade-off in Real-World Super-Resolution

Figure 3 for Crafting Training Degradation Distribution for the Accuracy-Generalization Trade-off in Real-World Super-Resolution

Figure 4 for Crafting Training Degradation Distribution for the Accuracy-Generalization Trade-off in Real-World Super-Resolution

Abstract:Super-resolution (SR) techniques designed for real-world applications commonly encounter two primary challenges: generalization performance and restoration accuracy. We demonstrate that when methods are trained using complex, large-range degradations to enhance generalization, a decline in accuracy is inevitable. However, since the degradation in a certain real-world applications typically exhibits a limited variation range, it becomes feasible to strike a trade-off between generalization performance and testing accuracy within this scope. In this work, we introduce a novel approach to craft training degradation distributions using a small set of reference images. Our strategy is founded upon the binned representation of the degradation space and the Fr\'echet distance between degradation distributions. Our results indicate that the proposed technique significantly improves the performance of test images while preserving generalization capabilities in real-world applications.

* This paper has been accepted to ICML 2023

Via

Access Paper or Ask Questions

Hierarchical Integration Diffusion Model for Realistic Image Deblurring

May 24, 2023

Zheng Chen, Yulun Zhang, Ding Liu, Bin Xia, Jinjin Gu, Linghe Kong, Xin Yuan

Abstract:Diffusion models (DMs) have recently been introduced in image deblurring and exhibited promising performance, particularly in terms of details reconstruction. However, the diffusion model requires a large number of inference iterations to recover the clean image from pure Gaussian noise, which consumes massive computational resources. Moreover, the distribution synthesized by the diffusion model is often misaligned with the target results, leading to restrictions in distortion-based metrics. To address the above issues, we propose the Hierarchical Integration Diffusion Model (HI-Diff), for realistic image deblurring. Specifically, we perform the DM in a highly compacted latent space to generate the prior feature for the deblurring process. The deblurring process is implemented by a regression-based method to obtain better distortion accuracy. Meanwhile, the highly compact latent space ensures the efficiency of the DM. Furthermore, we design the hierarchical integration module to fuse the prior into the regression-based model from multiple scales, enabling better generalization in complex blurry scenarios. Comprehensive experiments on synthetic and real-world blur datasets demonstrate that our HI-Diff outperforms state-of-the-art methods. Code and trained models are available at https://github.com/zhengchen1999/HI-Diff.

* Code is available at https://github.com/zhengchen1999/HI-Diff

Via

Access Paper or Ask Questions

Networks are Slacking Off: Understanding Generalization Problem in Image Deraining

May 24, 2023

Jinjin Gu, Xianzheng Ma, Xiangtao Kong, Yu Qiao, Chao Dong

Figure 1 for Networks are Slacking Off: Understanding Generalization Problem in Image Deraining

Figure 2 for Networks are Slacking Off: Understanding Generalization Problem in Image Deraining

Figure 3 for Networks are Slacking Off: Understanding Generalization Problem in Image Deraining

Figure 4 for Networks are Slacking Off: Understanding Generalization Problem in Image Deraining

Abstract:Deep deraining networks, while successful in laboratory benchmarks, consistently encounter substantial generalization issues when deployed in real-world applications. A prevailing perspective in deep learning encourages the use of highly complex training data, with the expectation that a richer image content knowledge will facilitate overcoming the generalization problem. However, through comprehensive and systematic experimentation, we discovered that this strategy does not enhance the generalization capability of these networks. On the contrary, it exacerbates the tendency of networks to overfit to specific degradations. Our experiments reveal that better generalization in a deraining network can be achieved by simplifying the complexity of the training data. This is due to the networks are slacking off during training, that is, learning the least complex elements in the image content and degradation to minimize training loss. When the complexity of the background image is less than that of the rain streaks, the network will prioritize the reconstruction of the background, thereby avoiding overfitting to the rain patterns and resulting in improved generalization performance. Our research not only offers a valuable perspective and methodology for better understanding the generalization problem in low-level vision tasks, but also displays promising practical potential.

Via

Access Paper or Ask Questions

Masked Image Training for Generalizable Deep Image Denoising

Mar 23, 2023

Haoyu Chen, Jinjin Gu, Yihao Liu, Salma Abdel Magid, Chao Dong, Qiong Wang, Hanspeter Pfister, Lei Zhu

Figure 1 for Masked Image Training for Generalizable Deep Image Denoising

Figure 2 for Masked Image Training for Generalizable Deep Image Denoising

Figure 3 for Masked Image Training for Generalizable Deep Image Denoising

Figure 4 for Masked Image Training for Generalizable Deep Image Denoising

Abstract:When capturing and storing images, devices inevitably introduce noise. Reducing this noise is a critical task called image denoising. Deep learning has become the de facto method for image denoising, especially with the emergence of Transformer-based models that have achieved notable state-of-the-art results on various image tasks. However, deep learning-based methods often suffer from a lack of generalization ability. For example, deep models trained on Gaussian noise may perform poorly when tested on other noise distributions. To address this issue, we present a novel approach to enhance the generalization performance of denoising networks, known as masked training. Our method involves masking random pixels of the input image and reconstructing the missing information during training. We also mask out the features in the self-attention layers to avoid the impact of training-testing inconsistency. Our approach exhibits better generalization ability than other deep learning models and is directly applicable to real-world scenarios. Additionally, our interpretability analysis demonstrates the superiority of our method.

* Accepted to CVPR 2023

Via

Access Paper or Ask Questions

Recursive Generalization Transformer for Image Super-Resolution

Mar 14, 2023

Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, Xiaokang Yang

Figure 1 for Recursive Generalization Transformer for Image Super-Resolution

Figure 2 for Recursive Generalization Transformer for Image Super-Resolution

Figure 3 for Recursive Generalization Transformer for Image Super-Resolution

Figure 4 for Recursive Generalization Transformer for Image Super-Resolution

Abstract:Transformer architectures have exhibited remarkable performance in image super-resolution (SR). Since the quadratic computational complexity of the self-attention (SA) in Transformer, existing methods tend to adopt SA in a local region to reduce overheads. However, the local design restricts the global context exploitation, which is critical for accurate image reconstruction. In this work, we propose the Recursive Generalization Transformer (RGT) for image SR, which can capture global spatial information and is suitable for high-resolution images. Specifically, we propose the recursive-generalization self-attention (RG-SA). It recursively aggregates input features into representative feature maps, and then utilizes cross-attention to extract global information. Meanwhile, the channel dimensions of attention matrices (query, key, and value) are further scaled for a better trade-off between computational overheads and performance. Furthermore, we combine the RG-SA with local self-attention to enhance the exploitation of the global context, and propose the hybrid adaptive integration (HAI) for module integration. The HAI allows the direct and effective fusion between features at different levels (local or global). Extensive experiments demonstrate that our RGT outperforms recent state-of-the-art methods.

Via

Access Paper or Ask Questions

Xformer: Hybrid X-Shaped Transformer for Image Denoising

Mar 11, 2023

Jiale Zhang, Yulun Zhang, Jinjin Gu, Jiahua Dong, Linghe Kong, Xiaokang Yang

Abstract:In this paper, we present a hybrid X-shaped vision Transformer, named Xformer, which performs notably on image denoising tasks. We explore strengthening the global representation of tokens from different scopes. In detail, we adopt two types of Transformer blocks. The spatial-wise Transformer block performs fine-grained local patches interactions across tokens defined by spatial dimension. The channel-wise Transformer block performs direct global context interactions across tokens defined by channel dimension. Based on the concurrent network structure, we design two branches to conduct these two interaction fashions. Within each branch, we employ an encoder-decoder architecture to capture multi-scale features. Besides, we propose the Bidirectional Connection Unit (BCU) to couple the learned representations from these two branches while providing enhanced information fusion. The joint designs make our Xformer powerful to conduct global information modeling in both spatial and channel dimensions. Extensive experiments show that Xformer, under the comparable model complexity, achieves state-of-the-art performance on the synthetic and real-world image denoising tasks.

Via

Access Paper or Ask Questions