Alert button
Picture for Liang Liao

Liang Liao

Alert button

Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment

Aug 23, 2023
Kangmin Xu, Liang Liao, Jing Xiao, Chaofeng Chen, Haoning Wu, Qiong Yan, Weisi Lin

Figure 1 for Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment
Figure 2 for Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment
Figure 3 for Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment
Figure 4 for Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment

Image Quality Assessment (IQA) constitutes a fundamental task within the field of computer vision, yet it remains an unresolved challenge, owing to the intricate distortion conditions, diverse image contents, and limited availability of data. Recently, the community has witnessed the emergence of numerous large-scale pretrained foundation models, which greatly benefit from dramatically increased data and parameter capacities. However, it remains an open problem whether the scaling law in high-level tasks is also applicable to IQA task which is closely related to low-level clues. In this paper, we demonstrate that with proper injection of local distortion features, a larger pretrained and fixed foundation model performs better in IQA tasks. Specifically, for the lack of local distortion structure and inductive bias of vision transformer (ViT), alongside the large-scale pretrained ViT, we use another pretrained convolution neural network (CNN), which is well known for capturing the local structure, to extract multi-scale image features. Further, we propose a local distortion extractor to obtain local distortion features from the pretrained CNN and a local distortion injector to inject the local distortion features into ViT. By only training the extractor and injector, our method can benefit from the rich knowledge in the powerful foundation models and achieve state-of-the-art performance on popular IQA datasets, indicating that IQA is not only a low-level problem but also benefits from stronger high-level features drawn from large-scale pretrained models.

Viaarxiv icon

TOPIQ: A Top-down Approach from Semantics to Distortions for Image Quality Assessment

Aug 06, 2023
Chaofeng Chen, Jiadi Mo, Jingwen Hou, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, Weisi Lin

Figure 1 for TOPIQ: A Top-down Approach from Semantics to Distortions for Image Quality Assessment
Figure 2 for TOPIQ: A Top-down Approach from Semantics to Distortions for Image Quality Assessment
Figure 3 for TOPIQ: A Top-down Approach from Semantics to Distortions for Image Quality Assessment
Figure 4 for TOPIQ: A Top-down Approach from Semantics to Distortions for Image Quality Assessment

Image Quality Assessment (IQA) is a fundamental task in computer vision that has witnessed remarkable progress with deep neural networks. Inspired by the characteristics of the human visual system, existing methods typically use a combination of global and local representations (\ie, multi-scale features) to achieve superior performance. However, most of them adopt simple linear fusion of multi-scale features, and neglect their possibly complex relationship and interaction. In contrast, humans typically first form a global impression to locate important regions and then focus on local details in those regions. We therefore propose a top-down approach that uses high-level semantics to guide the IQA network to focus on semantically important local distortion regions, named as \emph{TOPIQ}. Our approach to IQA involves the design of a heuristic coarse-to-fine network (CFANet) that leverages multi-scale features and progressively propagates multi-level semantic information to low-level representations in a top-down manner. A key component of our approach is the proposed cross-scale attention mechanism, which calculates attention maps for lower level features guided by higher level features. This mechanism emphasizes active semantic regions for low-level distortions, thereby improving performance. CFANet can be used for both Full-Reference (FR) and No-Reference (NR) IQA. We use ResNet50 as its backbone and demonstrate that CFANet achieves better or competitive performance on most public FR and NR benchmarks compared with state-of-the-art methods based on vision transformers, while being much more efficient (with only ${\sim}13\%$ FLOPS of the current best FR method). Codes are released at \url{https://github.com/chaofengc/IQA-PyTorch}.

* 13 pages, 8 figures, 10 tables. In submission 
Viaarxiv icon

Color Image Recovery Using Generalized Matrix Completion over Higher-Order Finite Dimensional Algebra

Aug 04, 2023
Liang Liao, Zhuang Guo, Qi Gao, Yan Wang, Fajun Yu, Qifeng Zhao, Stephen Johh Maybank

Figure 1 for Color Image Recovery Using Generalized Matrix Completion over Higher-Order Finite Dimensional Algebra
Figure 2 for Color Image Recovery Using Generalized Matrix Completion over Higher-Order Finite Dimensional Algebra
Figure 3 for Color Image Recovery Using Generalized Matrix Completion over Higher-Order Finite Dimensional Algebra
Figure 4 for Color Image Recovery Using Generalized Matrix Completion over Higher-Order Finite Dimensional Algebra

To improve the accuracy of color image completion with missing entries, we present a recovery method based on generalized higher-order scalars. We extend the traditional second-order matrix model to a more comprehensive higher-order matrix equivalent, called the "t-matrix" model, which incorporates a pixel neighborhood expansion strategy to characterize the local pixel constraints. This "t-matrix" model is then used to extend some commonly used matrix and tensor completion algorithms to their higher-order versions. We perform extensive experiments on various algorithms using simulated data and algorithms on simulated data and publicly available images and compare their performance. The results show that our generalized matrix completion model and the corresponding algorithm compare favorably with their lower-order tensor and conventional matrix counterparts.

* 24 pages; 9 figures 
Viaarxiv icon

TransRef: Multi-Scale Reference Embedding Transformer for Reference-Guided Image Inpainting

Jun 21, 2023
Liang Liao, Taorong Liu, Delin Chen, Jing Xiao, Zheng Wang, Chia-Wen Lin, Shin'ichi Satoh

Figure 1 for TransRef: Multi-Scale Reference Embedding Transformer for Reference-Guided Image Inpainting
Figure 2 for TransRef: Multi-Scale Reference Embedding Transformer for Reference-Guided Image Inpainting
Figure 3 for TransRef: Multi-Scale Reference Embedding Transformer for Reference-Guided Image Inpainting
Figure 4 for TransRef: Multi-Scale Reference Embedding Transformer for Reference-Guided Image Inpainting

Image inpainting for completing complicated semantic environments and diverse hole patterns of corrupted images is challenging even for state-of-the-art learning-based inpainting methods trained on large-scale data. A reference image capturing the same scene of a corrupted image offers informative guidance for completing the corrupted image as it shares similar texture and structure priors to that of the holes of the corrupted image. In this work, we propose a transformer-based encoder-decoder network, named TransRef, for reference-guided image inpainting. Specifically, the guidance is conducted progressively through a reference embedding procedure, in which the referencing features are subsequently aligned and fused with the features of the corrupted image. For precise utilization of the reference features for guidance, a reference-patch alignment (Ref-PA) module is proposed to align the patch features of the reference and corrupted images and harmonize their style differences, while a reference-patch transformer (Ref-PT) module is proposed to refine the embedded reference feature. Moreover, to facilitate the research of reference-guided image restoration tasks, we construct a publicly accessible benchmark dataset containing 50K pairs of input and reference images. Both quantitative and qualitative evaluations demonstrate the efficacy of the reference information and the proposed method over the state-of-the-art methods in completing complex holes. Code and dataset can be accessed at https://github.com/Cameltr/TransRef.

* Under review 
Viaarxiv icon

Towards Explainable In-the-Wild Video Quality Assessment: a Database and a Language-Prompted Approach

May 22, 2023
Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, Weisi Lin

Figure 1 for Towards Explainable In-the-Wild Video Quality Assessment: a Database and a Language-Prompted Approach
Figure 2 for Towards Explainable In-the-Wild Video Quality Assessment: a Database and a Language-Prompted Approach
Figure 3 for Towards Explainable In-the-Wild Video Quality Assessment: a Database and a Language-Prompted Approach
Figure 4 for Towards Explainable In-the-Wild Video Quality Assessment: a Database and a Language-Prompted Approach

The proliferation of in-the-wild videos has greatly expanded the Video Quality Assessment (VQA) problem. Unlike early definitions that usually focus on limited distortion types, VQA on in-the-wild videos is especially challenging as it could be affected by complicated factors, including various distortions and diverse contents. Though subjective studies have collected overall quality scores for these videos, how the abstract quality scores relate with specific factors is still obscure, hindering VQA methods from more concrete quality evaluations (e.g. sharpness of a video). To solve this problem, we collect over two million opinions on 4,543 in-the-wild videos on 13 dimensions of quality-related factors, including in-capture authentic distortions (e.g. motion blur, noise, flicker), errors introduced by compression and transmission, and higher-level experiences on semantic contents and aesthetic issues (e.g. composition, camera trajectory), to establish the multi-dimensional Maxwell database. Specifically, we ask the subjects to label among a positive, a negative, and a neural choice for each dimension. These explanation-level opinions allow us to measure the relationships between specific quality factors and abstract subjective quality ratings, and to benchmark different categories of VQA algorithms on each dimension, so as to more comprehensively analyze their strengths and weaknesses. Furthermore, we propose the MaxVQA, a language-prompted VQA approach that modifies vision-language foundation model CLIP to better capture important quality issues as observed in our analyses. The MaxVQA can jointly evaluate various specific quality factors and final quality scores with state-of-the-art accuracy on all dimensions, and superb generalization ability on existing datasets. Code and data available at \url{https://github.com/VQAssessment/MaxVQA}.

* 12 pages (with appendix). Under review, non-finalised version 
Viaarxiv icon

GCFAgg: Global and Cross-view Feature Aggregation for Multi-view Clustering

May 11, 2023
Weiqing Yan, Yuanyang Zhang, Chenlei Lv, Chang Tang, Guanghui Yue, Liang Liao, Weisi Lin

Figure 1 for GCFAgg: Global and Cross-view Feature Aggregation for Multi-view Clustering
Figure 2 for GCFAgg: Global and Cross-view Feature Aggregation for Multi-view Clustering
Figure 3 for GCFAgg: Global and Cross-view Feature Aggregation for Multi-view Clustering
Figure 4 for GCFAgg: Global and Cross-view Feature Aggregation for Multi-view Clustering

Multi-view clustering can partition data samples into their categories by learning a consensus representation in unsupervised way and has received more and more attention in recent years. However, most existing deep clustering methods learn consensus representation or view-specific representations from multiple views via view-wise aggregation way, where they ignore structure relationship of all samples. In this paper, we propose a novel multi-view clustering network to address these problems, called Global and Cross-view Feature Aggregation for Multi-View Clustering (GCFAggMVC). Specifically, the consensus data presentation from multiple views is obtained via cross-sample and cross-view feature aggregation, which fully explores the complementary ofsimilar samples. Moreover, we align the consensus representation and the view-specific representation by the structure-guided contrastive learning module, which makes the view-specific representations from different samples with high structure relationship similar. The proposed module is a flexible multi-view data representation module, which can be also embedded to the incomplete multi-view data clustering task via plugging our module into other frameworks. Extensive experiments show that the proposed method achieves excellent performance in both complete multi-view data clustering tasks and incomplete multi-view data clustering tasks.

Viaarxiv icon

Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video Quality Assessment

Apr 28, 2023
Haoning Wu, Liang Liao, Annan Wang, Chaofeng Chen, Jingwen Hou, Wenxiu Sun, Qiong Yan, Weisi Lin

Figure 1 for Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video Quality Assessment
Figure 2 for Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video Quality Assessment
Figure 3 for Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video Quality Assessment
Figure 4 for Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video Quality Assessment

The proliferation of videos collected during in-the-wild natural settings has pushed the development of effective Video Quality Assessment (VQA) methodologies. Contemporary supervised opinion-driven VQA strategies predominantly hinge on training from expensive human annotations for quality scores, which limited the scale and distribution of VQA datasets and consequently led to unsatisfactory generalization capacity of methods driven by these data. On the other hand, although several handcrafted zero-shot quality indices do not require training from human opinions, they are unable to account for the semantics of videos, rendering them ineffective in comprehending complex authentic distortions (e.g., white balance, exposure) and assessing the quality of semantic content within videos. To address these challenges, we introduce the text-prompted Semantic Affinity Quality Index (SAQI) and its localized version (SAQI-Local) using Contrastive Language-Image Pre-training (CLIP) to ascertain the affinity between textual prompts and visual features, facilitating a comprehensive examination of semantic quality concerns without the reliance on human quality annotations. By amalgamating SAQI with existing low-level metrics, we propose the unified Blind Video Quality Index (BVQI) and its improved version, BVQI-Local, which demonstrates unprecedented performance, surpassing existing zero-shot indices by at least 24\% on all datasets. Moreover, we devise an efficient fine-tuning scheme for BVQI-Local that jointly optimizes text prompts and final fusion weights, resulting in state-of-the-art performance and superior generalization ability in comparison to prevalent opinion-driven VQA methods. We conduct comprehensive analyses to investigate different quality concerns of distinct indices, demonstrating the effectiveness and rationality of our design.

* 13 pages, 10 figures, under review 
Viaarxiv icon

Exploring Opinion-unaware Video Quality Assessment with Semantic Affinity Criterion

Feb 26, 2023
Haoning Wu, Liang Liao, Jingwen Hou, Chaofeng Chen, Erli Zhang, Annan Wang, Wenxiu Sun, Qiong Yan, Weisi Lin

Figure 1 for Exploring Opinion-unaware Video Quality Assessment with Semantic Affinity Criterion
Figure 2 for Exploring Opinion-unaware Video Quality Assessment with Semantic Affinity Criterion
Figure 3 for Exploring Opinion-unaware Video Quality Assessment with Semantic Affinity Criterion
Figure 4 for Exploring Opinion-unaware Video Quality Assessment with Semantic Affinity Criterion

Recent learning-based video quality assessment (VQA) algorithms are expensive to implement due to the cost of data collection of human quality opinions, and are less robust across various scenarios due to the biases of these opinions. This motivates our exploration on opinion-unaware (a.k.a zero-shot) VQA approaches. Existing approaches only considers low-level naturalness in spatial or temporal domain, without considering impacts from high-level semantics. In this work, we introduce an explicit semantic affinity index for opinion-unaware VQA using text-prompts in the contrastive language-image pre-training (CLIP) model. We also aggregate it with different traditional low-level naturalness indexes through gaussian normalization and sigmoid rescaling strategies. Composed of aggregated semantic and technical metrics, the proposed Blind Unified Opinion-Unaware Video Quality Index via Semantic and Technical Metric Aggregation (BUONA-VISTA) outperforms existing opinion-unaware VQA methods by at least 20% improvements, and is more robust than opinion-aware approaches.

Viaarxiv icon

Disentangling Aesthetic and Technical Effects for Video Quality Assessment of User Generated Content

Nov 16, 2022
Haoning Wu, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, Weisi Lin

Figure 1 for Disentangling Aesthetic and Technical Effects for Video Quality Assessment of User Generated Content
Figure 2 for Disentangling Aesthetic and Technical Effects for Video Quality Assessment of User Generated Content
Figure 3 for Disentangling Aesthetic and Technical Effects for Video Quality Assessment of User Generated Content
Figure 4 for Disentangling Aesthetic and Technical Effects for Video Quality Assessment of User Generated Content

User-generated-content (UGC) videos have dominated the Internet during recent years. While it is well-recognized that the perceptual quality of these videos can be affected by diverse factors, few existing methods explicitly explore the effects of different factors in video quality assessment (VQA) for UGC videos, i.e. the UGC-VQA problem. In this work, we make the first attempt to disentangle the effects of aesthetic quality issues and technical quality issues risen by the complicated video generation processes in the UGC-VQA problem. To overcome the absence of respective supervisions during disentanglement, we propose the Limited View Biased Supervisions (LVBS) scheme where two separate evaluators are trained with decomposed views specifically designed for each issue. Composed of an Aesthetic Quality Evaluator (AQE) and a Technical Quality Evaluator (TQE) under the LVBS scheme, the proposed Disentangled Objective Video Quality Evaluator (DOVER) reach excellent performance (0.91 SRCC for KoNViD-1k, 0.89 SRCC for LSVQ, 0.88 SRCC for YouTube-UGC) in the UGC-VQA problem. More importantly, our blind subjective studies prove that the separate evaluators in DOVER can effectively match human perception on respective disentangled quality issues. Codes and demos are released in https://github.com/teowu/dover.

* 19 pages, 18 figures, 20 equation. The with-appendix version 
Viaarxiv icon