Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xinyang Jiang

Compression-Realized Deep Structural Network for Video Quality Enhancement

May 10, 2024

Hanchi Sun, Xiaohong Liu, Xinyang Jiang, Yifei Shen, Dongsheng Li, Xiongkuo Min, Guangtao Zhai

Figure 1 for Compression-Realized Deep Structural Network for Video Quality Enhancement

Figure 2 for Compression-Realized Deep Structural Network for Video Quality Enhancement

Figure 3 for Compression-Realized Deep Structural Network for Video Quality Enhancement

Figure 4 for Compression-Realized Deep Structural Network for Video Quality Enhancement

Abstract:This paper focuses on the task of quality enhancement for compressed videos. Although deep network-based video restorers achieve impressive progress, most of the existing methods lack a structured design to optimally leverage the priors within compression codecs. Since the quality degradation of the video is primarily induced by the compression algorithm, a new paradigm is urgently needed for a more "conscious" process of quality enhancement. As a result, we propose the Compression-Realize Deep Structural Network (CRDS), introducing three inductive biases aligned with the three primary processes in the classic compression codec, merging the strengths of classical encoder architecture with deep network capabilities. Inspired by the residual extraction and domain transformation process in the codec, a pre-trained Latent Degradation Residual Auto-Encoder is proposed to transform video frames into a latent feature space, and the mutual neighborhood attention mechanism is integrated for precise motion estimation and residual extraction. Furthermore, drawing inspiration from the quantization noise distribution of the codec, CRDS proposes a novel Progressive Denoising framework with intermediate supervision that decomposes the quality enhancement into a series of simpler denoising sub-tasks. Experimental results on datasets like LDV 2.0 and MFQE 2.0 indicate our approach surpasses state-of-the-art models.

Via

Access Paper or Ask Questions

LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation

Apr 01, 2024

Zilong Wang, Xufang Luo, Xinyang Jiang, Dongsheng Li, Lili Qiu

Figure 1 for LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation

Figure 2 for LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation

Figure 3 for LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation

Figure 4 for LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation

Abstract:Evaluating generated radiology reports is crucial for the development of radiology AI, but existing metrics fail to reflect the task's clinical requirements. This study proposes a novel evaluation framework using large language models (LLMs) to compare radiology reports for assessment. We compare the performance of various LLMs and demonstrate that, when using GPT-4, our proposed metric achieves evaluation consistency close to that of radiologists. Furthermore, to reduce costs and improve accessibility, making this method practical, we construct a dataset using LLM evaluation results and perform knowledge distillation to train a smaller model. The distilled model achieves evaluation capabilities comparable to GPT-4. Our framework and distilled model offer an accessible and efficient evaluation method for radiology report generation, facilitating the development of more clinically relevant models. The model will be further open-sourced and accessible.

* 11 pages, 6 figures

Via

Access Paper or Ask Questions

Understanding Training-free Diffusion Guidance: Mechanisms and Limitations

Mar 19, 2024

Yifei Shen, Xinyang Jiang, Yezhen Wang, Yifan Yang, Dongqi Han, Dongsheng Li

Figure 1 for Understanding Training-free Diffusion Guidance: Mechanisms and Limitations

Figure 2 for Understanding Training-free Diffusion Guidance: Mechanisms and Limitations

Figure 3 for Understanding Training-free Diffusion Guidance: Mechanisms and Limitations

Figure 4 for Understanding Training-free Diffusion Guidance: Mechanisms and Limitations

Abstract:Adding additional control to pretrained diffusion models has become an increasingly popular research area, with extensive applications in computer vision, reinforcement learning, and AI for science. Recently, several studies have proposed training-free diffusion guidance by using off-the-shelf networks pretrained on clean images. This approach enables zero-shot conditional generation for universal control formats, which appears to offer a free lunch in diffusion guidance. In this paper, we aim to develop a deeper understanding of the operational mechanisms and fundamental limitations of training-free guidance. We offer a theoretical analysis that supports training-free guidance from the perspective of optimization, distinguishing it from classifier-based (or classifier-free) guidance. To elucidate their drawbacks, we theoretically demonstrate that training-free methods are more susceptible to adversarial gradients and exhibit slower convergence rates compared to classifier guidance. We then introduce a collection of techniques designed to overcome the limitations, accompanied by theoretical rationale and empirical evidence. Our experiments in image and motion generation confirm the efficacy of these techniques.

Via

Access Paper or Ask Questions

DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models

Dec 21, 2023

Brian Nlong Zhao, Yuhang Xiao, Jiashu Xu, Xinyang Jiang, Yifan Yang, Dongsheng Li, Laurent Itti, Vibhav Vineet, Yunhao Ge

Figure 1 for DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models

Figure 2 for DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models

Figure 3 for DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models

Figure 4 for DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models

Abstract:The popularization of Text-to-Image (T2I) diffusion models enables the generation of high-quality images from text descriptions. However, generating diverse customized images with reference visual attributes remains challenging. This work focuses on personalizing T2I diffusion models at a more abstract concept or category level, adapting commonalities from a set of reference images while creating new instances with sufficient variations. We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts, enabling the generation of novel images by sampling prompts from the learned distribution. These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions. We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D. Finally we demonstrate effectiveness of our approach through quantitative analysis including automatic evaluation and human assessment. Project website: https://briannlongzhao.github.io/DreamDistribution

Via

Access Paper or Ask Questions

Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models

Dec 11, 2023

Yubin Wang, Xinyang Jiang, De Cheng, Dongsheng Li, Cairong Zhao

Abstract:Prompt learning has become a prevalent strategy for adapting vision-language foundation models to downstream tasks. As large language models (LLMs) have emerged, recent studies have explored the use of category-related descriptions as input to enhance prompt effectiveness. Nevertheless, conventional descriptions fall short of structured information that effectively represents the interconnections among entities or attributes linked to a particular category. To address this limitation and prioritize harnessing structured knowledge, this paper advocates for leveraging LLMs to build a graph for each description to model the entities and attributes describing the category, as well as their correlations. Preexisting prompt tuning methods exhibit inadequacies in managing this structured knowledge. Consequently, we propose a novel approach called Hierarchical Prompt Tuning (HPT), which enables simultaneous modeling of both structured and conventional linguistic knowledge. Specifically, we introduce a relationship-guided attention module to capture pair-wise associations among entities and attributes for low-level prompt learning. In addition, by incorporating high-level and global-level prompts modeling overall semantics, the proposed hierarchical structure forges cross-level interlinks and empowers the model to handle more complex and long-term relationships. Extensive experiments demonstrate that our HPT shows strong effectiveness and generalizes much better than existing SOTA methods. Our code is available at https://github.com/Vill-Lab/2024-AAAI-HPT.

* AAAI2024

Via

Access Paper or Ask Questions

Unified Medical Image Pre-training in Language-Guided Common Semantic Space

Nov 24, 2023

Xiaoxuan He, Yifan Yang, Xinyang Jiang, Xufang Luo, Haoji Hu, Siyun Zhao, Dongsheng Li, Yuqing Yang, Lili Qiu

Figure 1 for Unified Medical Image Pre-training in Language-Guided Common Semantic Space

Figure 2 for Unified Medical Image Pre-training in Language-Guided Common Semantic Space

Figure 3 for Unified Medical Image Pre-training in Language-Guided Common Semantic Space

Figure 4 for Unified Medical Image Pre-training in Language-Guided Common Semantic Space

Abstract:Vision-Language Pre-training (VLP) has shown the merits of analysing medical images, by leveraging the semantic congruence between medical images and their corresponding reports. It efficiently learns visual representations, which in turn facilitates enhanced analysis and interpretation of intricate imaging data. However, such observation is predominantly justified on single-modality data (mostly 2D images like X-rays), adapting VLP to learning unified representations for medical images in real scenario remains an open challenge. This arises from medical images often encompass a variety of modalities, especially modalities with different various number of dimensions (e.g., 3D images like Computed Tomography). To overcome the aforementioned challenges, we propose an Unified Medical Image Pre-training framework, namely UniMedI, which utilizes diagnostic reports as common semantic space to create unified representations for diverse modalities of medical images (especially for 2D and 3D images). Under the text's guidance, we effectively uncover visual modality information, identifying the affected areas in 2D X-rays and slices containing lesion in sophisticated 3D CT scans, ultimately enhancing the consistency across various medical imaging modalities. To demonstrate the effectiveness and versatility of UniMedI, we evaluate its performance on both 2D and 3D images across 10 different datasets, covering a wide range of medical image tasks such as classification, segmentation, and retrieval. UniMedI has demonstrated superior performance in downstream tasks, showcasing its effectiveness in establishing a universal medical visual representation.

Via

Access Paper or Ask Questions

Online Video Quality Enhancement with Spatial-Temporal Look-up Tables

Nov 22, 2023

Zefan Qu, Xinyang Jiang, Yifan Yang, Dongsheng Li, Cairong Zhao

Figure 1 for Online Video Quality Enhancement with Spatial-Temporal Look-up Tables

Figure 2 for Online Video Quality Enhancement with Spatial-Temporal Look-up Tables

Figure 3 for Online Video Quality Enhancement with Spatial-Temporal Look-up Tables

Figure 4 for Online Video Quality Enhancement with Spatial-Temporal Look-up Tables

Abstract:Low latency rates are crucial for online video-based applications, such as video conferencing and cloud gaming, which make improving video quality in online scenarios increasingly important. However, existing quality enhancement methods are limited by slow inference speed and the requirement for temporal information contained in future frames, making it challenging to deploy them directly in online tasks. In this paper, we propose a novel method, STLVQE, specifically designed to address the rarely studied online video quality enhancement (Online-VQE) problem. Our STLVQE designs a new VQE framework which contains a Module-Agnostic Feature Extractor that greatly reduces the redundant computations and redesign the propagation, alignment, and enhancement module of the network. A Spatial-Temporal Look-up Tables (STL) is proposed, which extracts spatial-temporal information in videos while saving substantial inference time. To the best of our knowledge, we are the first to exploit the LUT structure to extract temporal information in video tasks. Extensive experiments on the MFQE 2.0 dataset demonstrate that our STLVQE achieves a satisfactory performance-speed trade-off.

Via

Access Paper or Ask Questions

AccFlow: Backward Accumulation for Long-Range Optical Flow

Aug 25, 2023

Guangyang Wu, Xiaohong Liu, Kunming Luo, Xi Liu, Qingqing Zheng, Shuaicheng Liu, Xinyang Jiang, Guangtao Zhai, Wenyi Wang

Figure 1 for AccFlow: Backward Accumulation for Long-Range Optical Flow

Figure 2 for AccFlow: Backward Accumulation for Long-Range Optical Flow

Figure 3 for AccFlow: Backward Accumulation for Long-Range Optical Flow

Figure 4 for AccFlow: Backward Accumulation for Long-Range Optical Flow

Abstract:Recent deep learning-based optical flow estimators have exhibited impressive performance in generating local flows between consecutive frames. However, the estimation of long-range flows between distant frames, particularly under complex object deformation and large motion occlusion, remains a challenging task. One promising solution is to accumulate local flows explicitly or implicitly to obtain the desired long-range flow. Nevertheless, the accumulation errors and flow misalignment can hinder the effectiveness of this approach. This paper proposes a novel recurrent framework called AccFlow, which recursively backward accumulates local flows using a deformable module called as AccPlus. In addition, an adaptive blending module is designed along with AccPlus to alleviate the occlusion effect by backward accumulation and rectify the accumulation error. Notably, we demonstrate the superiority of backward accumulation over conventional forward accumulation, which to the best of our knowledge has not been explicitly established before. To train and evaluate the proposed AccFlow, we have constructed a large-scale high-quality dataset named CVO, which provides ground-truth optical flow labels between adjacent and distant frames. Extensive experiments validate the effectiveness of AccFlow in handling long-range optical flow estimation. Codes are available at https://github.com/mulns/AccFlow .

Via

Access Paper or Ask Questions

Dissecting Arbitrary-scale Super-resolution Capability from Pre-trained Diffusion Generative Models

Jun 01, 2023

Ruibin Li, Qihua Zhou, Song Guo, Jie Zhang, Jingcai Guo, Xinyang Jiang, Yifei Shen, Zhenhua Han

Figure 1 for Dissecting Arbitrary-scale Super-resolution Capability from Pre-trained Diffusion Generative Models

Figure 2 for Dissecting Arbitrary-scale Super-resolution Capability from Pre-trained Diffusion Generative Models

Figure 3 for Dissecting Arbitrary-scale Super-resolution Capability from Pre-trained Diffusion Generative Models

Figure 4 for Dissecting Arbitrary-scale Super-resolution Capability from Pre-trained Diffusion Generative Models

Abstract:Diffusion-based Generative Models (DGMs) have achieved unparalleled performance in synthesizing high-quality visual content, opening up the opportunity to improve image super-resolution (SR) tasks. Recent solutions for these tasks often train architecture-specific DGMs from scratch, or require iterative fine-tuning and distillation on pre-trained DGMs, both of which take considerable time and hardware investments. More seriously, since the DGMs are established with a discrete pre-defined upsampling scale, they cannot well match the emerging requirements of arbitrary-scale super-resolution (ASSR), where a unified model adapts to arbitrary upsampling scales, instead of preparing a series of distinct models for each case. These limitations beg an intriguing question: can we identify the ASSR capability of existing pre-trained DGMs without the need for distillation or fine-tuning? In this paper, we take a step towards resolving this matter by proposing Diff-SR, a first ASSR attempt based solely on pre-trained DGMs, without additional training efforts. It is motivated by an exciting finding that a simple methodology, which first injects a specific amount of noise into the low-resolution images before invoking a DGM's backward diffusion process, outperforms current leading solutions. The key insight is determining a suitable amount of noise to inject, i.e., small amounts lead to poor low-level fidelity, while over-large amounts degrade the high-level signature. Through a finely-grained theoretical analysis, we propose the Perceptual Recoverable Field (PRF), a metric that achieves the optimal trade-off between these two factors. Extensive experiments verify the effectiveness, flexibility, and adaptability of Diff-SR, demonstrating superior performance to state-of-the-art solutions under diverse ASSR environments.

Via

Access Paper or Ask Questions

Online Video Streaming Super-Resolution with Adaptive Look-Up Table Fusion

Mar 01, 2023

Guanghao Yin, Xinyang Jiang, Shan Jiang, Zhenhua Han, Ningxin Zheng, Huan Yang, Donglin Bai, Haisheng Tan, Shouqian Sun, Yuqing Yang(+2 more)

Abstract:This paper focuses on Super-resolution for online video streaming data. Applying existing super-resolution methods to video streaming data is non-trivial for two reasons. First, to support application with constant interactions, video streaming has a high requirement for latency that most existing methods are less applicable, especially on low-end devices. Second, existing video streaming protocols (e.g., WebRTC) dynamically adapt the video quality to the network condition, thus video streaming in the wild varies greatly under different network bandwidths, which leads to diverse and dynamic degradations. To tackle the above two challenges, we proposed a novel video super-resolution method for online video streaming. First, we incorporate Look-Up Table (LUT) to lightweight convolution modules to achieve real-time latency. Second, for variant degradations, we propose a pixel-level LUT fusion strategy, where a set of LUT bases are built upon state-of-the-art SR networks pre-trained on different degraded data, and those LUT bases are combined with extracted weights from lightweight convolution modules to adaptively handle dynamic degradations. Extensive experiments are conducted on a newly proposed online video streaming dataset named LDV-WebRTC. All the results show that our method significantly outperforms existing LUT-based methods and offers competitive SR performance with faster speed compared to efficient CNN-based methods. Accelerated with our parallel LUT inference, our proposed method can even support online 720P video SR around 100 FPS.

Via

Access Paper or Ask Questions