Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Siwei Ma

Point Cloud-Assisted Neural Image Compression

Dec 16, 2024

Ziqun Li, Qi Zhang, Xiaofeng Huang, Zhao Wang, Siwei Ma, Wei Yan

Abstract:High-efficient image compression is a critical requirement. In several scenarios where multiple modalities of data are captured by different sensors, the auxiliary information from other modalities are not fully leveraged by existing image-only codecs, leading to suboptimal compression efficiency. In this paper, we increase image compression performance with the assistance of point cloud, which is widely adopted in the area of autonomous driving. We first unify the data representation for both modalities to facilitate data processing. Then, we propose the point cloud-assisted neural image codec (PCA-NIC) to enhance the preservation of image texture and structure by utilizing the high-dimensional point cloud information. We further introduce a multi-modal feature fusion transform module (MMFFT) to capture more representative image features, remove redundant information between channels and modalities that are not relevant to the image content. Our work is the first to improve image compression performance using point cloud and achieves state-of-the-art performance.

Via

Access Paper or Ask Questions

Advanced Learning-Based Inter Prediction for Future Video Coding

Nov 24, 2024

Yanchen Zhao, Wenhong Duan, Chuanmin Jia, Shanshe Wang, Siwei Ma

Figure 1 for Advanced Learning-Based Inter Prediction for Future Video Coding

Figure 2 for Advanced Learning-Based Inter Prediction for Future Video Coding

Figure 3 for Advanced Learning-Based Inter Prediction for Future Video Coding

Figure 4 for Advanced Learning-Based Inter Prediction for Future Video Coding

Abstract:In the fourth generation Audio Video coding Standard (AVS4), the Inter Prediction Filter (INTERPF) reduces discontinuities between prediction and adjacent reconstructed pixels in inter prediction. The paper proposes a low complexity learning-based inter prediction (LLIP) method to replace the traditional INTERPF. LLIP enhances the filtering process by leveraging a lightweight neural network model, where parameters can be exported for efficient inference. Specifically, we extract pixels and coordinates utilized by the traditional INTERPF to form the training dataset. Subsequently, we export the weights and biases of the trained neural network model and implement the inference process without any third-party dependency, enabling seamless integration into video codec without relying on Libtorch, thus achieving faster inference speed. Ultimately, we replace the traditional handcraft filtering parameters in INTERPF with the learned optimal filtering parameters. This practical solution makes the combination of deep learning encoding tools with traditional video encoding schemes more efficient. Experimental results show that our approach achieves 0.01%, 0.31%, and 0.25% coding gain for the Y, U, and V components under the random access (RA) configuration on average.

Via

Access Paper or Ask Questions

Compact Visual Data Representation for Green Multimedia -- A Human Visual System Perspective

Nov 21, 2024

Peilin Chen, Xiaohan Fang, Meng Wang, Shiqi Wang, Siwei Ma

Figure 1 for Compact Visual Data Representation for Green Multimedia -- A Human Visual System Perspective

Figure 2 for Compact Visual Data Representation for Green Multimedia -- A Human Visual System Perspective

Abstract:The Human Visual System (HVS), with its intricate sophistication, is capable of achieving ultra-compact information compression for visual signals. This remarkable ability is coupled with high generalization capability and energy efficiency. By contrast, the state-of-the-art Versatile Video Coding (VVC) standard achieves a compression ratio of around 1,000 times for raw visual data. This notable disparity motivates the research community to draw inspiration to effectively handle the immense volume of visual data in a green way. Therefore, this paper provides a survey of how visual data can be efficiently represented for green multimedia, in particular when the ultimate task is knowledge extraction instead of visual signal reconstruction. We introduce recent research efforts that promote green, sustainable, and efficient multimedia in this field. Moreover, we discuss how the deep understanding of the HVS can benefit the research community, and envision the development of future green multimedia technologies.

Via

Access Paper or Ask Questions

Frequency Decomposition-Driven Unsupervised Domain Adaptation for Remote Sensing Image Semantic Segmentation

Apr 06, 2024

Xianping Ma, Xiaokang Zhang, Xingchen Ding, Man-On Pun, Siwei Ma

Figure 1 for Frequency Decomposition-Driven Unsupervised Domain Adaptation for Remote Sensing Image Semantic Segmentation

Figure 2 for Frequency Decomposition-Driven Unsupervised Domain Adaptation for Remote Sensing Image Semantic Segmentation

Figure 3 for Frequency Decomposition-Driven Unsupervised Domain Adaptation for Remote Sensing Image Semantic Segmentation

Figure 4 for Frequency Decomposition-Driven Unsupervised Domain Adaptation for Remote Sensing Image Semantic Segmentation

Abstract:Cross-domain semantic segmentation of remote sensing (RS) imagery based on unsupervised domain adaptation (UDA) techniques has significantly advanced deep-learning applications in the geosciences. Recently, with its ingenious and versatile architecture, the Transformer model has been successfully applied in RS-UDA tasks. However, existing UDA methods mainly focus on domain alignment in the high-level feature space. It is still challenging to retain cross-domain local spatial details and global contextual semantics simultaneously, which is crucial for the RS image semantic segmentation task. To address these problems, we propose novel high/low-frequency decomposition (HLFD) techniques to guide representation alignment in cross-domain semantic segmentation. Specifically, HLFD attempts to decompose the feature maps into high- and low-frequency components before performing the domain alignment in the corresponding subspaces. Secondly, to further facilitate the alignment of decomposed features, we propose a fully global-local generative adversarial network, namely GLGAN, to learn domain-invariant detailed and semantic features across domains by leveraging global-local transformer blocks (GLTBs). By integrating HLFD techniques and the GLGAN, a novel UDA framework called FD-GLGAN is developed to improve the cross-domain transferability and generalization capability of semantic segmentation models. Extensive experiments on two fine-resolution benchmark datasets, namely ISPRS Potsdam and ISPRS Vaihingen, highlight the effectiveness and superiority of the proposed approach as compared to the state-of-the-art UDA methods. The source code for this work will be accessible at https://github.com/sstary/SSRS.

* 28 pages, 13 figures

Via

Access Paper or Ask Questions

Mirror-3DGS: Incorporating Mirror Reflections into 3D Gaussian Splatting

Apr 01, 2024

Jiarui Meng, Haijie Li, Yanmin Wu, Qiankun Gao, Shuzhou Yang, Jian Zhang, Siwei Ma

Figure 1 for Mirror-3DGS: Incorporating Mirror Reflections into 3D Gaussian Splatting

Figure 2 for Mirror-3DGS: Incorporating Mirror Reflections into 3D Gaussian Splatting

Figure 3 for Mirror-3DGS: Incorporating Mirror Reflections into 3D Gaussian Splatting

Figure 4 for Mirror-3DGS: Incorporating Mirror Reflections into 3D Gaussian Splatting

Abstract:3D Gaussian Splatting (3DGS) has marked a significant breakthrough in the realm of 3D scene reconstruction and novel view synthesis. However, 3DGS, much like its predecessor Neural Radiance Fields (NeRF), struggles to accurately model physical reflections, particularly in mirrors that are ubiquitous in real-world scenes. This oversight mistakenly perceives reflections as separate entities that physically exist, resulting in inaccurate reconstructions and inconsistent reflective properties across varied viewpoints. To address this pivotal challenge, we introduce Mirror-3DGS, an innovative rendering framework devised to master the intricacies of mirror geometries and reflections, paving the way for the generation of realistically depicted mirror reflections. By ingeniously incorporating mirror attributes into the 3DGS and leveraging the principle of plane mirror imaging, Mirror-3DGS crafts a mirrored viewpoint to observe from behind the mirror, enriching the realism of scene renderings. Extensive assessments, spanning both synthetic and real-world scenes, showcase our method's ability to render novel views with enhanced fidelity in real-time, surpassing the state-of-the-art Mirror-NeRF specifically within the challenging mirror regions. Our code will be made publicly available for reproducible research.

* 22 pages, 7 figures

Via

Access Paper or Ask Questions

Unifying Generation and Compression: Ultra-low bitrate Image Coding Via Multi-stage Transformer

Mar 06, 2024

Naifu Xue, Qi Mao, Zijian Wang, Yuan Zhang, Siwei Ma

Abstract:Recent progress in generative compression technology has significantly improved the perceptual quality of compressed data. However, these advancements primarily focus on producing high-frequency details, often overlooking the ability of generative models to capture the prior distribution of image content, thus impeding further bitrate reduction in extreme compression scenarios (<0.05 bpp). Motivated by the capabilities of predictive language models for lossless compression, this paper introduces a novel Unified Image Generation-Compression (UIGC) paradigm, merging the processes of generation and compression. A key feature of the UIGC framework is the adoption of vector-quantized (VQ) image models for tokenization, alongside a multi-stage transformer designed to exploit spatial contextual information for modeling the prior distribution. As such, the dual-purpose framework effectively utilizes the learned prior for entropy estimation and assists in the regeneration of lost tokens. Extensive experiments demonstrate the superiority of the proposed UIGC framework over existing codecs in perceptual quality and human perception, particularly in ultra-low bitrate scenarios (<=0.03 bpp), pioneering a new direction in generative compression.

Via

Access Paper or Ask Questions

SPC-NeRF: Spatial Predictive Compression for Voxel Based Radiance Field

Feb 26, 2024

Zetian Song, Wenhong Duan, Yuhuai Zhang, Shiqi Wang, Siwei Ma, Wen Gao

Figure 1 for SPC-NeRF: Spatial Predictive Compression for Voxel Based Radiance Field

Figure 2 for SPC-NeRF: Spatial Predictive Compression for Voxel Based Radiance Field

Figure 3 for SPC-NeRF: Spatial Predictive Compression for Voxel Based Radiance Field

Figure 4 for SPC-NeRF: Spatial Predictive Compression for Voxel Based Radiance Field

Abstract:Representing the Neural Radiance Field (NeRF) with the explicit voxel grid (EVG) is a promising direction for improving NeRFs. However, the EVG representation is not efficient for storage and transmission because of the terrific memory cost. Current methods for compressing EVG mainly inherit the methods designed for neural network compression, such as pruning and quantization, which do not take full advantage of the spatial correlation of voxels. Inspired by prosperous digital image compression techniques, this paper proposes SPC-NeRF, a novel framework applying spatial predictive coding in EVG compression. The proposed framework can remove spatial redundancy efficiently for better compression performance.Moreover, we model the bitrate and design a novel form of the loss function, where we can jointly optimize compression ratio and distortion to achieve higher coding efficiency. Extensive experiments demonstrate that our method can achieve 32% bit saving compared to the state-of-the-art method VQRF on multiple representative test datasets, with comparable training time.

Via

Access Paper or Ask Questions

A Neural-network Enhanced Video Coding Framework beyond ECM

Feb 21, 2024

Yanchen Zhao, Wenxuan He, Chuanmin Jia, Qizhe Wang, Junru Li, Yue Li, Chaoyi Lin, Kai Zhang, Li Zhang, Siwei Ma

Figure 1 for A Neural-network Enhanced Video Coding Framework beyond ECM

Figure 2 for A Neural-network Enhanced Video Coding Framework beyond ECM

Figure 3 for A Neural-network Enhanced Video Coding Framework beyond ECM

Figure 4 for A Neural-network Enhanced Video Coding Framework beyond ECM

Abstract:In this paper, a hybrid video compression framework is proposed that serves as a demonstrative showcase of deep learning-based approaches extending beyond the confines of traditional coding methodologies. The proposed hybrid framework is founded upon the Enhanced Compression Model (ECM), which is a further enhancement of the Versatile Video Coding (VVC) standard. We have augmented the latest ECM reference software with well-designed coding techniques, including block partitioning, deep learning-based loop filter, and the activation of block importance mapping (BIM) which was integrated but previously inactive within ECM, further enhancing coding performance. Compared with ECM-10.0, our method achieves 6.26, 13.33, and 12.33 BD-rate savings for the Y, U, and V components under random access (RA) configuration, respectively.

Via

Access Paper or Ask Questions

Scalable Face Image Coding via StyleGAN Prior: Towards Compression for Human-Machine Collaborative Vision

Dec 25, 2023

Qi Mao, Chongyu Wang, Meng Wang, Shiqi Wang, Ruijie Chen, Libiao Jin, Siwei Ma

Figure 1 for Scalable Face Image Coding via StyleGAN Prior: Towards Compression for Human-Machine Collaborative Vision

Figure 2 for Scalable Face Image Coding via StyleGAN Prior: Towards Compression for Human-Machine Collaborative Vision

Figure 3 for Scalable Face Image Coding via StyleGAN Prior: Towards Compression for Human-Machine Collaborative Vision

Figure 4 for Scalable Face Image Coding via StyleGAN Prior: Towards Compression for Human-Machine Collaborative Vision

Abstract:The accelerated proliferation of visual content and the rapid development of machine vision technologies bring significant challenges in delivering visual data on a gigantic scale, which shall be effectively represented to satisfy both human and machine requirements. In this work, we investigate how hierarchical representations derived from the advanced generative prior facilitate constructing an efficient scalable coding paradigm for human-machine collaborative vision. Our key insight is that by exploiting the StyleGAN prior, we can learn three-layered representations encoding hierarchical semantics, which are elaborately designed into the basic, middle, and enhanced layers, supporting machine intelligence and human visual perception in a progressive fashion. With the aim of achieving efficient compression, we propose the layer-wise scalable entropy transformer to reduce the redundancy between layers. Based on the multi-task scalable rate-distortion objective, the proposed scheme is jointly optimized to achieve optimal machine analysis performance, human perception experience, and compression ratio. We validate the proposed paradigm's feasibility in face image compression. Extensive qualitative and quantitative experimental results demonstrate the superiority of the proposed paradigm over the latest compression standard Versatile Video Coding (VVC) in terms of both machine analysis as well as human perception at extremely low bitrates ($<0.01$ bpp), offering new insights for human-machine collaborative compression.

* Accepted by IEEE TIP

Via

Access Paper or Ask Questions

Spatial-Temporal Transformer based Video Compression Framework

Sep 21, 2023

Yanbo Gao, Wenjia Huang, Shuai Li, Hui Yuan, Mao Ye, Siwei Ma

Abstract:Learned video compression (LVC) has witnessed remarkable advancements in recent years. Similar as the traditional video coding, LVC inherits motion estimation/compensation, residual coding and other modules, all of which are implemented with neural networks (NNs). However, within the framework of NNs and its training mechanism using gradient backpropagation, most existing works often struggle to consistently generate stable motion information, which is in the form of geometric features, from the input color features. Moreover, the modules such as the inter-prediction and residual coding are independent from each other, making it inefficient to fully reduce the spatial-temporal redundancy. To address the above problems, in this paper, we propose a novel Spatial-Temporal Transformer based Video Compression (STT-VC) framework. It contains a Relaxed Deformable Transformer (RDT) with Uformer based offsets estimation for motion estimation and compensation, a Multi-Granularity Prediction (MGP) module based on multi-reference frames for prediction refinement, and a Spatial Feature Distribution prior based Transformer (SFD-T) for efficient temporal-spatial joint residual compression. Specifically, RDT is developed to stably estimate the motion information between frames by thoroughly investigating the relationship between the similarity based geometric motion feature extraction and self-attention. MGP is designed to fuse the multi-reference frame information by effectively exploring the coarse-grained prediction feature generated with the coded motion information. SFD-T is to compress the residual information by jointly exploring the spatial feature distributions in both residual and temporal prediction to further reduce the spatial-temporal redundancy. Experimental results demonstrate that our method achieves the best result with 13.5% BD-Rate saving over VTM.

Via

Access Paper or Ask Questions