Alert button
Picture for Shanxin Yuan

Shanxin Yuan

Alert button

Exploring Effective Mask Sampling Modeling for Neural Image Compression

Jun 09, 2023
Lin Liu, Mingming Zhao, Shanxin Yuan, Wenlong Lyu, Wengang Zhou, Houqiang Li, Yanfeng Wang, Qi Tian

Figure 1 for Exploring Effective Mask Sampling Modeling for Neural Image Compression
Figure 2 for Exploring Effective Mask Sampling Modeling for Neural Image Compression
Figure 3 for Exploring Effective Mask Sampling Modeling for Neural Image Compression
Figure 4 for Exploring Effective Mask Sampling Modeling for Neural Image Compression

Image compression aims to reduce the information redundancy in images. Most existing neural image compression methods rely on side information from hyperprior or context models to eliminate spatial redundancy, but rarely address the channel redundancy. Inspired by the mask sampling modeling in recent self-supervised learning methods for natural language processing and high-level vision, we propose a novel pretraining strategy for neural image compression. Specifically, Cube Mask Sampling Module (CMSM) is proposed to apply both spatial and channel mask sampling modeling to image compression in the pre-training stage. Moreover, to further reduce channel redundancy, we propose the Learnable Channel Mask Module (LCMM) and the Learnable Channel Completion Module (LCCM). Our plug-and-play CMSM, LCMM, LCCM modules can apply to both CNN-based and Transformer-based architectures, significantly reduce the computational cost, and improve the quality of images. Experiments on the public Kodak and Tecnick datasets demonstrate that our method achieves competitive performance with lower computational complexity compared to state-of-the-art image compression methods.

* 10 pages 
Viaarxiv icon

NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds

Apr 13, 2023
Chen Yang, Peihao Li, Zanwei Zhou, Shanxin Yuan, Bingbing Liu, Xiaokang Yang, Weichao Qiu, Wei Shen

Figure 1 for NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds
Figure 2 for NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds
Figure 3 for NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds
Figure 4 for NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds

We present NeRFVS, a novel neural radiance fields (NeRF) based method to enable free navigation in a room. NeRF achieves impressive performance in rendering images for novel views similar to the input views while suffering for novel views that are significantly different from the training views. To address this issue, we utilize the holistic priors, including pseudo depth maps and view coverage information, from neural reconstruction to guide the learning of implicit neural representations of 3D indoor scenes. Concretely, an off-the-shelf neural reconstruction method is leveraged to generate a geometry scaffold. Then, two loss functions based on the holistic priors are proposed to improve the learning of NeRF: 1) A robust depth loss that can tolerate the error of the pseudo depth map to guide the geometry learning of NeRF; 2) A variance loss to regularize the variance of implicit neural representations to reduce the geometry and color ambiguity in the learning procedure. These two loss functions are modulated during NeRF optimization according to the view coverage information to reduce the negative influence brought by the view coverage imbalance. Extensive results demonstrate that our NeRFVS outperforms state-of-the-art view synthesis methods quantitatively and qualitatively on indoor scenes, achieving high-fidelity free navigation results.

* 10 pages, 7 figures 
Viaarxiv icon

Graph Neural Networks in Vision-Language Image Understanding: A Survey

Mar 07, 2023
Henry Senior, Gregory Slabaugh, Shanxin Yuan, Luca Rossi

Figure 1 for Graph Neural Networks in Vision-Language Image Understanding: A Survey
Figure 2 for Graph Neural Networks in Vision-Language Image Understanding: A Survey
Figure 3 for Graph Neural Networks in Vision-Language Image Understanding: A Survey
Figure 4 for Graph Neural Networks in Vision-Language Image Understanding: A Survey

2D image understanding is a complex problem within Computer Vision, but it holds the key to providing human level scene comprehension. It goes further than identifying the objects in an image, and instead it attempts to understand the scene. Solutions to this problem form the underpinning of a range of tasks, including image captioning, Visual Question Answering (VQA), and image retrieval. Graphs provide a natural way to represent the relational arrangement between objects in an image, and thus in recent years Graph Neural Networks (GNNs) have become a standard component of many 2D image understanding pipelines, becoming a core architectural component especially in the VQA group of tasks. In this survey, we review this rapidly evolving field and we provide a taxonomy of graph types used in 2D image understanding approaches, a comprehensive list of the GNN models used in this domain, and a roadmap of future potential developments. To the best of our knowledge, this is the first comprehensive survey that covers image captioning, visual question answering, and image retrieval techniques that focus on using GNNs as the main part of their architecture.

* 19 pages, 5 figures, 6 tables 
Viaarxiv icon

Low-Light Video Enhancement with Synthetic Event Guidance

Aug 23, 2022
Lin Liu, Junfeng An, Jianzhuang Liu, Shanxin Yuan, Xiangyu Chen, Wengang Zhou, Houqiang Li, Yanfeng Wang, Qi Tian

Figure 1 for Low-Light Video Enhancement with Synthetic Event Guidance
Figure 2 for Low-Light Video Enhancement with Synthetic Event Guidance
Figure 3 for Low-Light Video Enhancement with Synthetic Event Guidance
Figure 4 for Low-Light Video Enhancement with Synthetic Event Guidance

Low-light video enhancement (LLVE) is an important yet challenging task with many applications such as photographing and autonomous driving. Unlike single image low-light enhancement, most LLVE methods utilize temporal information from adjacent frames to restore the color and remove the noise of the target frame. However, these algorithms, based on the framework of multi-frame alignment and enhancement, may produce multi-frame fusion artifacts when encountering extreme low light or fast motion. In this paper, inspired by the low latency and high dynamic range of events, we use synthetic events from multiple frames to guide the enhancement and restoration of low-light videos. Our method contains three stages: 1) event synthesis and enhancement, 2) event and image fusion, and 3) low-light enhancement. In this framework, we design two novel modules (event-image fusion transform and event-guided dual branch) for the second and third stages, respectively. Extensive experiments show that our method outperforms existing low-light video or single image enhancement approaches on both synthetic and real LLVE datasets.

Viaarxiv icon

Disentangling 3D Attributes from a Single 2D Image: Human Pose, Shape and Garment

Aug 05, 2022
Xue Hu, Xinghui Li, Benjamin Busam, Yiren Zhou, Ales Leonardis, Shanxin Yuan

Figure 1 for Disentangling 3D Attributes from a Single 2D Image: Human Pose, Shape and Garment
Figure 2 for Disentangling 3D Attributes from a Single 2D Image: Human Pose, Shape and Garment
Figure 3 for Disentangling 3D Attributes from a Single 2D Image: Human Pose, Shape and Garment
Figure 4 for Disentangling 3D Attributes from a Single 2D Image: Human Pose, Shape and Garment

For visual manipulation tasks, we aim to represent image content with semantically meaningful features. However, learning implicit representations from images often lacks interpretability, especially when attributes are intertwined. We focus on the challenging task of extracting disentangled 3D attributes only from 2D image data. Specifically, we focus on human appearance and learn implicit pose, shape and garment representations of dressed humans from RGB images. Our method learns an embedding with disentangled latent representations of these three image properties and enables meaningful re-assembling of features and property control through a 2D-to-3D encoder-decoder structure. The 3D model is inferred solely from the feature map in the learned embedding space. To the best of our knowledge, our method is the first to achieve cross-domain disentanglement for this highly under-constrained problem. We qualitatively and quantitatively demonstrate our framework's ability to transfer pose, shape, and garments in 3D reconstruction on virtual data and show how an implicit shape loss can benefit the model's ability to recover fine-grained reconstruction details.

Viaarxiv icon

SJ-HD^2R: Selective Joint High Dynamic Range and Denoising Imaging for Dynamic Scenes

Jun 20, 2022
Wei Li, Shuai Xiao, Tianhong Dai, Shanxin Yuan, Tao Wang, Cheng Li, Fenglong Song

Figure 1 for SJ-HD^2R: Selective Joint High Dynamic Range and Denoising Imaging for Dynamic Scenes
Figure 2 for SJ-HD^2R: Selective Joint High Dynamic Range and Denoising Imaging for Dynamic Scenes
Figure 3 for SJ-HD^2R: Selective Joint High Dynamic Range and Denoising Imaging for Dynamic Scenes
Figure 4 for SJ-HD^2R: Selective Joint High Dynamic Range and Denoising Imaging for Dynamic Scenes

Ghosting artifacts, motion blur, and low fidelity in highlight are the main challenges in High Dynamic Range (HDR) imaging from multiple Low Dynamic Range (LDR) images. These issues come from using the medium-exposed image as the reference frame in previous methods. To deal with them, we propose to use the under-exposed image as the reference to avoid these issues. However, the heavy noise in dark regions of the under-exposed image becomes a new problem. Therefore, we propose a joint HDR and denoising pipeline, containing two sub-networks: (i) a pre-denoising network (PreDNNet) to adaptively denoise input LDRs by exploiting exposure priors; (ii) a pyramid cascading fusion network (PCFNet), introducing an attention mechanism and cascading structure in a multi-scale manner. To further leverage these two paradigms, we propose a selective and joint HDR and denoising (SJ-HD$^2$R) imaging framework, utilizing scenario-specific priors to conduct the path selection with an accuracy of more than 93.3$\%$. We create the first joint HDR and denoising benchmark dataset, which contains a variety of challenging HDR and denoising scenes and supports the switching of the reference image. Extensive experiment results show that our method achieves superior performance to previous methods.

Viaarxiv icon

TAPE: Task-Agnostic Prior Embedding for Image Restoration

Mar 11, 2022
Lin Liu, Lingxi Xie, Xiaopeng Zhang, Shanxin Yuan, Xiangyu Chen, Wengang Zhou, Houqiang Li, Qi Tian

Figure 1 for TAPE: Task-Agnostic Prior Embedding for Image Restoration
Figure 2 for TAPE: Task-Agnostic Prior Embedding for Image Restoration
Figure 3 for TAPE: Task-Agnostic Prior Embedding for Image Restoration
Figure 4 for TAPE: Task-Agnostic Prior Embedding for Image Restoration

Learning an generalized prior for natural image restoration is an important yet challenging task. Early methods mostly involved handcrafted priors including normalized sparsity, L0 gradients, dark channel priors, etc. Recently, deep neural networks have been used to learn various image priors but do not guarantee to generalize. In this paper, we propose a novel approach that embeds a task-agnostic prior into a transformer. Our approach, named Task-Agnostic Prior Embedding (TAPE), consists of three stages, namely, task-agnostic pre-training, task-agnostic fine-tuning, and task-specific fine-tuning, where the first one embeds prior knowledge about natural images into the transformer and the latter two extracts the knowledge to assist downstream image restoration. Experiments on various types of degradation validate the effectiveness of TAPE. The image restoration performance in terms of PSNR is improved by as much as 1.45 dB and even outperforms task-specific algorithms. More importantly, TAPE shows the ability of disentangling generalized image priors from degraded images, which enjoys favorable transfer ability to unknown downstream tasks.

Viaarxiv icon

SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers

Dec 17, 2021
Lin Liu, Shanxin Yuan, Jianzhuang Liu, Xin Guo, Youliang Yan, Qi Tian

Figure 1 for SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers
Figure 2 for SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers
Figure 3 for SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers
Figure 4 for SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers

We propose a novel zero-shot multi-frame image restoration method for removing unwanted obstruction elements (such as rains, snow, and moire patterns) that vary in successive frames. It has three stages: transformer pre-training, zero-shot restoration, and hard patch refinement. Using the pre-trained transformers, our model is able to tell the motion difference between the true image information and the obstructing elements. For zero-shot image restoration, we design a novel model, termed SiamTrans, which is constructed by Siamese transformers, encoders, and decoders. Each transformer has a temporal attention layer and several self-attention layers, to capture both temporal and spatial information of multiple frames. Only pre-trained (self-supervised) on the denoising task, SiamTrans is tested on three different low-level vision tasks (deraining, demoireing, and desnowing). Compared with related methods, ours achieves the best performances, even outperforming those with supervised learning.

* AAAI 2022  
Viaarxiv icon

Wavelet-Based Network For High Dynamic Range Imaging

Aug 03, 2021
Tianhong Dai, Wei Li, Xilei Cao, Jianzhuang Liu, Xu Jia, Ales Leonardis, Youliang Yan, Shanxin Yuan

Figure 1 for Wavelet-Based Network For High Dynamic Range Imaging
Figure 2 for Wavelet-Based Network For High Dynamic Range Imaging
Figure 3 for Wavelet-Based Network For High Dynamic Range Imaging
Figure 4 for Wavelet-Based Network For High Dynamic Range Imaging

High dynamic range (HDR) imaging from multiple low dynamic range (LDR) images has been suffering from ghosting artifacts caused by scene and objects motion. Existing methods, such as optical flow based and end-to-end deep learning based solutions, are error-prone either in detail restoration or ghosting artifacts removal. Comprehensive empirical evidence shows that ghosting artifacts caused by large foreground motion are mainly low-frequency signals and the details are mainly high-frequency signals. In this work, we propose a novel frequency-guided end-to-end deep neural network (FHDRNet) to conduct HDR fusion in the frequency domain, and Discrete Wavelet Transform (DWT) is used to decompose inputs into different frequency bands. The low-frequency signals are used to avoid specific ghosting artifacts, while the high-frequency signals are used for preserving details. Using a U-Net as the backbone, we propose two novel modules: merging module and frequency-guided upsampling module. The merging module applies the attention mechanism to the low-frequency components to deal with the ghost caused by large foreground motion. The frequency-guided upsampling module reconstructs details from multiple frequency-specific components with rich details. In addition, a new RAW dataset is created for training and evaluating multi-frame HDR imaging algorithms in the RAW domain. Extensive experiments are conducted on public datasets and our RAW dataset, showing that the proposed FHDRNet achieves state-of-the-art performance.

Viaarxiv icon

Self-Adaptively Learning to Demoire from Focused and Defocused Image Pairs

Nov 05, 2020
Lin Liu, Shanxin Yuan, Jianzhuang Liu, Liping Bao, Gregory Slabaugh, Qi Tian

Figure 1 for Self-Adaptively Learning to Demoire from Focused and Defocused Image Pairs
Figure 2 for Self-Adaptively Learning to Demoire from Focused and Defocused Image Pairs
Figure 3 for Self-Adaptively Learning to Demoire from Focused and Defocused Image Pairs
Figure 4 for Self-Adaptively Learning to Demoire from Focused and Defocused Image Pairs

Moire artifacts are common in digital photography, resulting from the interference between high-frequency scene content and the color filter array of the camera. Existing deep learning-based demoireing methods trained on large scale datasets are limited in handling various complex moire patterns, and mainly focus on demoireing of photos taken of digital displays. Moreover, obtaining moire-free ground-truth in natural scenes is difficult but needed for training. In this paper, we propose a self-adaptive learning method for demoireing a high-frequency image, with the help of an additional defocused moire-free blur image. Given an image degraded with moire artifacts and a moire-free blur image, our network predicts a moire-free clean image and a blur kernel with a self-adaptive strategy that does not require an explicit training stage, instead performing test-time adaptation. Our model has two sub-networks and works iteratively. During each iteration, one sub-network takes the moire image as input, removing moire patterns and restoring image details, and the other sub-network estimates the blur kernel from the blur image. The two sub-networks are jointly optimized. Extensive experiments demonstrate that our method outperforms state-of-the-art methods and can produce high-quality demoired results. It can generalize well to the task of removing moire artifacts caused by display screens. In addition, we build a new moire dataset, including images with screen and texture moire artifacts. As far as we know, this is the first dataset with real texture moire patterns.

* Accepted to NeurIPS 2020. Project page: "http://home.ustc.edu.cn/~ll0825/project_FDNet.html" 
Viaarxiv icon