Existing learning-based frame interpolation algorithms extract consecutive frames from high-speed natural videos to train the model. Compared to natural videos, cartoon videos are usually in a low frame rate. Besides, the motion between consecutive cartoon frames is typically nonlinear, which breaks the linear motion assumption of interpolation algorithms. Thus, it is unsuitable for generating a training set directly from cartoon videos. For better adapting frame interpolation algorithms from nature video to animation video, we present AutoFI, a simple and effective method to automatically render training data for deep animation video interpolation. AutoFI takes a layered architecture to render synthetic data, which ensures the assumption of linear motion. Experimental results show that AutoFI performs favorably in training both DAIN and ANIN. However, most frame interpolation algorithms will still fail in error-prone areas, such as fast motion or large occlusion. Besides AutoFI, we also propose a plug-and-play sketch-based post-processing module, named SktFI, to refine the final results using user-provided sketches manually. With AutoFI and SktFI, the interpolated animation frames show high perceptual quality.
With the development of rendering techniques, computer graphics generated images (CGIs) have been widely used in practical application scenarios such as architecture design, video games, simulators, movies, etc. Different from natural scene images (NSIs), the distortions of CGIs are usually caused by poor rending settings and limited computation resources. What's more, some CGIs may also suffer from compression distortions in transmission systems like cloud gaming and stream media. However, limited work has been put forward to tackle the problem of computer graphics generated images' quality assessment (CG-IQA). Therefore, in this paper, we establish a large-scale subjective CG-IQA database to deal with the challenge of CG-IQA tasks. We collect 25,454 in-the-wild CGIs through previous databases and personal collection. After data cleaning, we carefully select 1,200 CGIs to conduct the subjective experiment. Several popular no-reference image quality assessment (NR-IQA) methods are tested on our database. The experimental results show that the handcrafted-based methods achieve low correlation with subjective judgment and deep learning based methods obtain relatively better performance, which demonstrates that the current NR-IQA models are not suitable for CG-IQA tasks and more effective models are urgently needed.
The 4K content can deliver a more immersive visual experience to consumers due to the huge improvement of spatial resolution. However, existing blind image quality assessment (BIQA) methods are not suitable for the original and upscaled 4K contents due to the expanded resolution and specific distortions. In this paper, we propose a deep learning-based BIQA model for 4K content, which on one hand can recognize true and pseudo 4K content and on the other hand can evaluate their perceptual visual quality. Considering the characteristic that high spatial resolution can represent more abundant high-frequency information, we first propose a Grey-level Co-occurrence Matrix (GLCM) based texture complexity measure to select three representative image patches from a 4K image, which can reduce the computational complexity and is proven to be very effective for the overall quality prediction through experiments. Then we extract different kinds of visual features from the intermediate layers of the convolutional neural network (CNN) and integrate them into the quality-aware feature representation. Finally, two multilayer perception (MLP) networks are utilized to map the quality-aware features into the class probability and the quality score for each patch respectively. The overall quality index is obtained through the average pooling of patch results. The proposed model is trained through the multi-task learning manner and we introduce an uncertainty principle to balance the losses of the classification and regression tasks. The experimental results show that the proposed model outperforms all compared BIQA metrics on four 4K content quality assessment databases.
The intelligent video surveillance system (IVSS) can automatically analyze the content of the surveillance image (SI) and reduce the burden of the manual labour. However, the SIs may suffer quality degradations in the procedure of acquisition, compression, and transmission, which makes IVSS hard to understand the content of SIs. In this paper, we first conduct an example experiment (i.e. the face detection task) to demonstrate that the quality of the SIs has a crucial impact on the performance of the IVSS, and then propose a saliency-based deep neural network for the blind quality assessment of the SIs, which helps IVSS to filter the low-quality SIs and improve the detection and recognition performance. Specifically, we first compute the saliency map of the SI to select the most salient local region since the salient regions usually contain rich semantic information for machine vision and thus have a great impact on the overall quality of the SIs. Next, the convolutional neural network (CNN) is adopted to extract quality-aware features for the whole image and local region, which are then mapped into the global and local quality scores through the fully connected (FC) network respectively. Finally, the overall quality score is computed as the weighted sum of the global and local quality scores. Experimental results on the SI quality database (SIQD) show that the proposed method outperforms all compared state-of-the-art BIQA methods.
Point cloud is one of the most widely used digital formats of 3D models, the visual quality of which is quite sensitive to distortions such as downsampling, noise, and compression. To tackle the challenge of point cloud quality assessment (PCQA) in scenarios where reference is not available, we propose a no-reference quality assessment metric for colored point cloud based on captured video sequences. Specifically, three video sequences are obtained by rotating the camera around the point cloud through three specific orbits. The video sequences not only contain the static views but also include the multi-frame temporal information, which greatly helps understand the human perception of the point clouds. Then we modify the ResNet3D as the feature extraction model to learn the correlation between the capture videos and corresponding subjective quality scores. The experimental results show that our method outperforms most of the state-of-the-art full-reference and no-reference PCQA metrics, which validates the effectiveness of the proposed method.
To support the application scenarios where high-resolution (HR) images are urgently needed, various single image super-resolution (SISR) algorithms are developed. However, SISR is an ill-posed inverse problem, which may bring artifacts like texture shift, blur, etc. to the reconstructed images, thus it is necessary to evaluate the quality of super-resolution images (SRIs). Note that most existing image quality assessment (IQA) methods were developed for synthetically distorted images, which may not work for SRIs since their distortions are more diverse and complicated. Therefore, in this paper, we propose a no-reference deep-learning image quality assessment method based on frequency maps because the artifacts caused by SISR algorithms are quite sensitive to frequency information. Specifically, we first obtain the high-frequency map (HM) and low-frequency map (LM) of SRI by using Sobel operator and piecewise smooth image approximation. Then, a two-stream network is employed to extract the quality-aware features of both frequency maps. Finally, the features are regressed into a single quality value using fully connected layers. The experimental results show that our method outperforms all compared IQA models on the selected three super-resolution quality assessment (SRQA) databases.
Recent years have witnessed the rapid development of image storage and transmission systems, in which image compression plays an important role. Generally speaking, image compression algorithms are developed to ensure good visual quality at limited bit rates. However, due to the different compression optimization methods, the compressed images may have different levels of quality, which needs to be evaluated quantificationally. Nowadays, the mainstream full-reference (FR) metrics are effective to predict the quality of compressed images at coarse-grained levels (the bit rates differences of compressed images are obvious), however, they may perform poorly for fine-grained compressed images whose bit rates differences are quite subtle. Therefore, to better improve the Quality of Experience (QoE) and provide useful guidance for compression algorithms, we propose a full-reference image quality assessment (FR-IQA) method for compressed images of fine-grained levels. Specifically, the reference images and compressed images are first converted to $YCbCr$ color space. The gradient features are extracted from regions that are sensitive to compression artifacts. Then we employ the Log-Gabor transformation to further analyze the texture difference. Finally, the obtained features are fused into a quality score. The proposed method is validated on the fine-grained compression image quality assessment (FGIQA) database, which is especially constructed for assessing the quality of compressed images with close bit rates. The experimental results show that our metric outperforms mainstream FR-IQA metrics on the FGIQA database. We also test our method on other commonly used compression IQA databases and the results show that our method obtains competitive performance on the coarse-grained compression IQA databases as well.
We present a novel vision Transformer, named TUTOR, which is able to learn tubelet tokens, served as highly-abstracted spatiotemporal representations, for video-based human-object interaction (V-HOI) detection. The tubelet tokens structurize videos by agglomerating and linking semantically-related patch tokens along spatial and temporal domains, which enjoy two benefits: 1) Compactness: each tubelet token is learned by a selective attention mechanism to reduce redundant spatial dependencies from others; 2) Expressiveness: each tubelet token is enabled to align with a semantic instance, i.e., an object or a human, across frames, thanks to agglomeration and linking. The effectiveness and efficiency of TUTOR are verified by extensive experiments. Results shows our method outperforms existing works by large margins, with a relative mAP gain of $16.14\%$ on VidHOI and a 2 points gain on CAD-120 as well as a $4 \times$ speedup.
Blind image quality assessment (BIQA), which aims to accurately predict the image quality without any pristine reference information, has been highly concerned in the past decades. Especially, with the help of deep neural networks, great progress has been achieved so far. However, it remains less investigated on BIQA for night-time images (NTIs) which usually suffer from complicated authentic distortions such as reduced visibility, low contrast, additive noises, and color distortions. These diverse authentic degradations particularly challenges the design of effective deep neural network for blind NTI quality evaluation (NTIQE). In this paper, we propose a novel deep decomposition and bilinear pooling network (DDB-Net) to better address this issue. The DDB-Net contains three modules, i.e., an image decomposition module, a feature encoding module, and a bilinear pooling module. The image decomposition module is inspired by the Retinex theory and involves decoupling the input NTI into an illumination layer component responsible for illumination information and a reflectance layer component responsible for content information. Then, the feature encoding module involves learning multi-scale feature representations of degradations that are rooted in the two decoupled components separately. Finally, by modeling illumination-related and content-related degradations as two-factor variations, the two multi-scale feature sets are bilinearly pooled and concatenated together to form a unified representation for quality prediction. The superiority of the proposed DDB-Net is well validated by extensive experiments on two publicly available night-time image databases.
Quality assessment for User Generated Content (UGC) videos plays an important role in ensuring the viewing experience of end-users. Previous UGC video quality assessment (VQA) studies either use the image recognition model or the image quality assessment (IQA) models to extract frame-level features of UGC videos for quality regression, which are regarded as the sub-optimal solutions because of the domain shifts between these tasks and the UGC VQA task. In this paper, we propose a very simple but effective UGC VQA model, which tries to address this problem by training an end-to-end spatial feature extraction network to directly learn the quality-aware spatial feature representation from raw pixels of the video frames. We also extract the motion features to measure the temporal-related distortions that the spatial features cannot model. The proposed model utilizes very sparse frames to extract spatial features and dense frames (i.e. the video chunk) with a very low spatial resolution to extract motion features, which thereby has low computational complexity. With the better quality-aware features, we only use the simple multilayer perception layer (MLP) network to regress them into the chunk-level quality scores, and then the temporal average pooling strategy is adopted to obtain the video-level quality score. We further introduce a multi-scale quality fusion strategy to solve the problem of VQA across different spatial resolutions, where the multi-scale weights are obtained from the contrast sensitivity function of the human visual system. The experimental results show that the proposed model achieves the best performance on five popular UGC VQA databases, which demonstrates the effectiveness of the proposed model. The code will be publicly available.