Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wen Gao

Textural-Structural Joint Learning for No-Reference Super-Resolution Image Quality Assessment

May 27, 2022

Yuqing Liu, Qi Jia, Shanshe Wang, Siwei Ma, Wen Gao

Figure 1 for Textural-Structural Joint Learning for No-Reference Super-Resolution Image Quality Assessment

Figure 2 for Textural-Structural Joint Learning for No-Reference Super-Resolution Image Quality Assessment

Figure 3 for Textural-Structural Joint Learning for No-Reference Super-Resolution Image Quality Assessment

Figure 4 for Textural-Structural Joint Learning for No-Reference Super-Resolution Image Quality Assessment

Abstract:Image super-resolution (SR) has been widely investigated in recent years. However, it is challenging to fairly estimate the performances of various SR methods, as the lack of reliable and accurate criteria for perceptual quality. Existing SR image quality assessment (IQA) metrics usually concentrate on the specific kind of degradation without distinguishing the visual sensitive areas, which have no adaptive ability to describe the diverse SR degeneration situations. In this paper, we focus on the textural and structural degradation of image SR which acts as a critical role for visual perception, and design a dual stream network to jointly explore the textural and structural information for quality prediction, dubbed TSNet. By mimicking the human vision system (HVS) that pays more attention to the significant areas of the image, we develop the spatial attention mechanism to make the visual-sensitive areas more distinguishable, which improves the prediction accuracy. Feature normalization (F-Norm) is also developed to investigate the inherent spatial correlation of SR features and boost the network representation capacity. Experimental results show the proposed TSNet predicts the visual quality more accurate than the state-of-the-art IQA methods, and demonstrates better consistency with the human's perspective. The source code will be made available at http://github.com/yuqing-liu-dut/NRIQA_SR.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

Learning Weighting Map for Bit-Depth Expansion within a Rational Range

Apr 26, 2022

Yuqing Liu, Qi Jia, Jian Zhang, Xin Fan, Shanshe Wang, Siwei Ma, Wen Gao

Figure 1 for Learning Weighting Map for Bit-Depth Expansion within a Rational Range

Figure 2 for Learning Weighting Map for Bit-Depth Expansion within a Rational Range

Figure 3 for Learning Weighting Map for Bit-Depth Expansion within a Rational Range

Figure 4 for Learning Weighting Map for Bit-Depth Expansion within a Rational Range

Abstract:Bit-depth expansion (BDE) is one of the emerging technologies to display high bit-depth (HBD) image from low bit-depth (LBD) source. Existing BDE methods have no unified solution for various BDE situations, and directly learn a mapping for each pixel from LBD image to the desired value in HBD image, which may change the given high-order bits and lead to a huge deviation from the ground truth. In this paper, we design a bit restoration network (BRNet) to learn a weight for each pixel, which indicates the ratio of the replenished value within a rational range, invoking an accurate solution without modifying the given high-order bit information. To make the network adaptive for any bit-depth degradation, we investigate the issue in an optimization perspective and train the network under progressive training strategy for better performance. Moreover, we employ Wasserstein distance as a visual quality indicator to evaluate the difference of color distribution between restored image and the ground truth. Experimental results show our method can restore colorful images with fewer artifacts and false contours, and outperforms state-of-the-art methods with higher PSNR/SSIM results and lower Wasserstein distance. The source code will be made available at https://github.com/yuqing-liu-dut/bit-depth-expansion

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond

Apr 20, 2022

Zheng Chang, Xinfeng Zhang, Shanshe Wang, Siwei Ma, Wen Gao

Figure 1 for STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond

Figure 2 for STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond

Figure 3 for STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond

Figure 4 for STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond

Abstract:Video prediction aims to predict future frames by modeling the complex spatiotemporal dynamics in videos. However, most of the existing methods only model the temporal information and the spatial information for videos in an independent manner but haven't fully explored the correlations between both terms. In this paper, we propose a SpatioTemporal-Aware Unit (STAU) for video prediction and beyond by exploring the significant spatiotemporal correlations in videos. On the one hand, the motion-aware attention weights are learned from the spatial states to help aggregate the temporal states in the temporal domain. On the other hand, the appearance-aware attention weights are learned from the temporal states to help aggregate the spatial states in the spatial domain. In this way, the temporal information and the spatial information can be greatly aware of each other in both domains, during which, the spatiotemporal receptive field can also be greatly broadened for more reliable spatiotemporal modeling. Experiments are not only conducted on traditional video prediction tasks but also other tasks beyond video prediction, including the early action recognition and object detection tasks. Experimental results show that our STAU can outperform other methods on all tasks in terms of performance and computation efficiency.

* This work has been submitted to TPAMI

Via

Access Paper or Ask Questions

STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution Video Prediction

Mar 30, 2022

Zheng Chang, Xinfeng Zhang, Shanshe Wang, Siwei Ma, Wen Gao

Figure 1 for STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution Video Prediction

Figure 2 for STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution Video Prediction

Figure 3 for STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution Video Prediction

Figure 4 for STRPM: A Spatiotemporal Residual Predictive Model for High-Resolution Video Prediction

Abstract:Although many video prediction methods have obtained good performance in low-resolution (64$\sim$128) videos, predictive models for high-resolution (512$\sim$4K) videos have not been fully explored yet, which are more meaningful due to the increasing demand for high-quality videos. Compared with low-resolution videos, high-resolution videos contain richer appearance (spatial) information and more complex motion (temporal) information. In this paper, we propose a Spatiotemporal Residual Predictive Model (STRPM) for high-resolution video prediction. On the one hand, we propose a Spatiotemporal Encoding-Decoding Scheme to preserve more spatiotemporal information for high-resolution videos. In this way, the appearance details for each frame can be greatly preserved. On the other hand, we design a Residual Predictive Memory (RPM) which focuses on modeling the spatiotemporal residual features (STRF) between previous and future frames instead of the whole frame, which can greatly help capture the complex motion information in high-resolution videos. In addition, the proposed RPM can supervise the spatial encoder and temporal encoder to extract different features in the spatial domain and the temporal domain, respectively. Moreover, the proposed model is trained using generative adversarial networks (GANs) with a learned perceptual loss (LP-loss) to improve the perceptual quality of the predictions. Experimental results show that STRPM can generate more satisfactory results compared with various existing methods.

* This work has been accepted by CVPR2022

Via

Access Paper or Ask Questions

Gradient Correction beyond Gradient Descent

Mar 16, 2022

Zefan Li, Bingbing Ni, Teng Li, WenJun Zhang, Wen Gao

Figure 1 for Gradient Correction beyond Gradient Descent

Figure 2 for Gradient Correction beyond Gradient Descent

Figure 3 for Gradient Correction beyond Gradient Descent

Figure 4 for Gradient Correction beyond Gradient Descent

Abstract:The great success neural networks have achieved is inseparable from the application of gradient-descent (GD) algorithms. Based on GD, many variant algorithms have emerged to improve the GD optimization process. The gradient for back-propagation is apparently the most crucial aspect for the training of a neural network. The quality of the calculated gradient can be affected by multiple aspects, e.g., noisy data, calculation error, algorithm limitation, and so on. To reveal gradient information beyond gradient descent, we introduce a framework (\textbf{GCGD}) to perform gradient correction. GCGD consists of two plug-in modules: 1) inspired by the idea of gradient prediction, we propose a \textbf{GC-W} module for weight gradient correction; 2) based on Neural ODE, we propose a \textbf{GC-ODE} module for hidden states gradient correction. Experiment results show that our gradient correction framework can effectively improve the gradient quality to reduce training epochs by $\sim$ 20\% and also improve the network performance.

Via

Access Paper or Ask Questions

P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation

Mar 15, 2022

Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, Wen Gao

Figure 1 for P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation

Figure 2 for P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation

Figure 3 for P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation

Figure 4 for P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation

Abstract:This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (P-STMO) model for 2D-to-3D human pose estimation task. To reduce the difficulty of capturing spatial and temporal information, we divide this task into two stages: pre-training (Stage I) and fine-tuning (Stage II). In Stage I, a self-supervised pre-training sub-task, termed masked pose modeling, is proposed. The human joints in the input sequence are randomly masked in both spatial and temporal domains. A general form of denoising auto-encoder is exploited to recover the original 2D poses and the encoder is capable of capturing spatial and temporal dependencies in this way. In Stage II, the pre-trained encoder is loaded to STMO model and fine-tuned. The encoder is followed by a many-to-one frame aggregator to predict the 3D pose in the current frame. Especially, an MLP block is utilized as the spatial feature extractor in STMO, which yields better performance than other methods. In addition, a temporal downsampling strategy is proposed to diminish data redundancy. Extensive experiments on two benchmarks show that our method outperforms state-of-the-art methods with fewer parameters and less computational overhead. For example, our P-STMO model achieves 42.1mm MPJPE on Human3.6M dataset when using 2D poses from CPN as inputs. Meanwhile, it brings a 1.5-7.1 times speedup to state-of-the-art methods. Code is available at https://github.com/paTRICK-swk/P-STMO.

* 25 pages

Via

Access Paper or Ask Questions

Cross-SRN: Structure-Preserving Super-Resolution Network with Cross Convolution

Jan 07, 2022

Yuqing Liu, Qi Jia, Xin Fan, Shanshe Wang, Siwei Ma, Wen Gao

Figure 1 for Cross-SRN: Structure-Preserving Super-Resolution Network with Cross Convolution

Figure 2 for Cross-SRN: Structure-Preserving Super-Resolution Network with Cross Convolution

Figure 3 for Cross-SRN: Structure-Preserving Super-Resolution Network with Cross Convolution

Figure 4 for Cross-SRN: Structure-Preserving Super-Resolution Network with Cross Convolution

Abstract:It is challenging to restore low-resolution (LR) images to super-resolution (SR) images with correct and clear details. Existing deep learning works almost neglect the inherent structural information of images, which acts as an important role for visual perception of SR results. In this paper, we design a hierarchical feature exploitation network to probe and preserve structural information in a multi-scale feature fusion manner. First, we propose a cross convolution upon traditional edge detectors to localize and represent edge features. Then, cross convolution blocks (CCBs) are designed with feature normalization and channel attention to consider the inherent correlations of features. Finally, we leverage multi-scale feature fusion group (MFFG) to embed the cross convolution blocks and develop the relations of structural features in different scales hierarchically, invoking a lightweight structure-preserving network named as Cross-SRN. Experimental results demonstrate the Cross-SRN achieves competitive or superior restoration performances against the state-of-the-art methods with accurate and clear structural details. Moreover, we set a criterion to select images with rich structural textures. The proposed Cross-SRN outperforms the state-of-the-art methods on the selected benchmark, which demonstrates that our network has a significant advantage in preserving edges.

Via

Access Paper or Ask Questions

ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Dec 23, 2021

Shuohuan Wang, Yu Sun, Yang Xiang, Zhihua Wu, Siyu Ding, Weibao Gong, Shikun Feng, Junyuan Shang, Yanbin Zhao, Chao Pang(+19 more)

Figure 1 for ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Figure 2 for ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Figure 3 for ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Figure 4 for ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation

Abstract:Pre-trained language models have achieved state-of-the-art results in various Natural Language Processing (NLP) tasks. GPT-3 has shown that scaling up pre-trained language models can further exploit their enormous potential. A unified framework named ERNIE 3.0 was recently proposed for pre-training large-scale knowledge enhanced models and trained a model with 10 billion parameters. ERNIE 3.0 outperformed the state-of-the-art models on various NLP tasks. In order to explore the performance of scaling up ERNIE 3.0, we train a hundred-billion-parameter model called ERNIE 3.0 Titan with up to 260 billion parameters on the PaddlePaddle platform. Furthermore, we design a self-supervised adversarial loss and a controllable language modeling loss to make ERNIE 3.0 Titan generate credible and controllable texts. To reduce the computation overhead and carbon emission, we propose an online distillation framework for ERNIE 3.0 Titan, where the teacher model will teach students and train itself simultaneously. ERNIE 3.0 Titan is the largest Chinese dense pre-trained model so far. Empirical results show that the ERNIE 3.0 Titan outperforms the state-of-the-art models on 68 NLP datasets.

* arXiv admin note: text overlap with arXiv:2107.02137

Via

Access Paper or Ask Questions

Towards End-to-End Image Compression and Analysis with Transformers

Dec 17, 2021

Yuanchao Bai, Xu Yang, Xianming Liu, Junjun Jiang, Yaowei Wang, Xiangyang Ji, Wen Gao

Figure 1 for Towards End-to-End Image Compression and Analysis with Transformers

Figure 2 for Towards End-to-End Image Compression and Analysis with Transformers

Figure 3 for Towards End-to-End Image Compression and Analysis with Transformers

Figure 4 for Towards End-to-End Image Compression and Analysis with Transformers

Abstract:We propose an end-to-end image compression and analysis model with Transformers, targeting to the cloud-based image classification application. Instead of placing an existing Transformer-based image classification model directly after an image codec, we aim to redesign the Vision Transformer (ViT) model to perform image classification from the compressed features and facilitate image compression with the long-term information from the Transformer. Specifically, we first replace the patchify stem (i.e., image splitting and embedding) of the ViT model with a lightweight image encoder modelled by a convolutional neural network. The compressed features generated by the image encoder are injected convolutional inductive bias and are fed to the Transformer for image classification bypassing image reconstruction. Meanwhile, we propose a feature aggregation module to fuse the compressed features with the selected intermediate features of the Transformer, and feed the aggregated features to a deconvolutional neural network for image reconstruction. The aggregated features can obtain the long-term information from the self-attention mechanism of the Transformer and improve the compression performance. The rate-distortion-accuracy optimization problem is finally solved by a two-step training strategy. Experimental results demonstrate the effectiveness of the proposed model in both the image compression and the classification tasks.

* Accepted by AAAI 2022; Code: https://github.com/BYchao100/Towards-Image-Compression-and-Analysis-with-Transformers

Via

Access Paper or Ask Questions

Improving Robustness and Accuracy via Relative Information Encoding in 3D Human Pose Estimation

Jul 29, 2021

Wenkang Shan, Haopeng Lu, Shanshe Wang, Xinfeng Zhang, Wen Gao

Figure 1 for Improving Robustness and Accuracy via Relative Information Encoding in 3D Human Pose Estimation

Figure 2 for Improving Robustness and Accuracy via Relative Information Encoding in 3D Human Pose Estimation

Figure 3 for Improving Robustness and Accuracy via Relative Information Encoding in 3D Human Pose Estimation

Figure 4 for Improving Robustness and Accuracy via Relative Information Encoding in 3D Human Pose Estimation

Abstract:Most of the existing 3D human pose estimation approaches mainly focus on predicting 3D positional relationships between the root joint and other human joints (local motion) instead of the overall trajectory of the human body (global motion). Despite the great progress achieved by these approaches, they are not robust to global motion, and lack the ability to accurately predict local motion with a small movement range. To alleviate these two problems, we propose a relative information encoding method that yields positional and temporal enhanced representations. Firstly, we encode positional information by utilizing relative coordinates of 2D poses to enhance the consistency between the input and output distribution. The same posture with different absolute 2D positions can be mapped to a common representation. It is beneficial to resist the interference of global motion on the prediction results. Second, we encode temporal information by establishing the connection between the current pose and other poses of the same person within a period of time. More attention will be paid to the movement changes before and after the current pose, resulting in better prediction performance on local motion with a small movement range. The ablation studies validate the effectiveness of the proposed relative information encoding method. Besides, we introduce a multi-stage optimization method to the whole framework to further exploit the positional and temporal enhanced representations. Our method outperforms state-of-the-art methods on two public datasets. Code is available at https://github.com/paTRICK-swk/Pose3D-RIE.

* In Proceedings of the 29th ACM International Conference on Multimedia (MM '21)

Via

Access Paper or Ask Questions