Abstract:This paper introduces Locally Adaptive Neural Context Estimation (LANCE), a novel extension for overfitted image compression (OIC) frameworks like Cool-Chic. While traditional OIC methods rely on lightweight autoregressive networks with globally signaled parameters, they struggle with non-stationary image statistics. LANCE addresses this by incorporating a forward-signaled spatial hyperprior that enables regional adaptation of the entropy model. To minimize overhead, we employ a predictive coding scheme that combines a static Median Edge Detector (MED) with a lightweight learned context model. Experiments demonstrate that LANCE achieves BD-rate reductions of 1.40% on the Kodak dataset and 1.97% on CLIC 2020 over Cool-Chic 4.0 at the high end of our decoder complexity range of 606-1481 MAC/pixel. At the low end of the complexity range, we outperform Cool-Chic 4.0 by 2.41% and 2.99% on Kodak and CLIC, respectively. Qualitative analysis reveals that the learned spatial hyperprior effectively segments image regions into areas of similar image statistics, providing an automated, content-aware adaptation layer.




Abstract:This paper aims to delve into the rate-distortion-complexity trade-offs of modern neural video coding. Recent years have witnessed much research effort being focused on exploring the full potential of neural video coding. Conditional autoencoders have emerged as the mainstream approach to efficient neural video coding. The central theme of conditional autoencoders is to leverage both spatial and temporal information for better conditional coding. However, a recent study indicates that conditional coding may suffer from information bottlenecks, potentially performing worse than traditional residual coding. To address this issue, recent conditional coding methods incorporate a large number of high-resolution features as the condition signal, leading to a considerable increase in the number of multiply-accumulate operations, memory footprint, and model size. Taking DCVC as the common code base, we investigate how the newly proposed conditional residual coding, an emerging new school of thought, and its variants may strike a better balance among rate, distortion, and complexity.
Abstract:Conditional coding has lately emerged as the mainstream approach to learned video compression. However, a recent study shows that it may perform worse than residual coding when the information bottleneck arises. Conditional residual coding was thus proposed, creating a new school of thought to improve on conditional coding. Notably, conditional residual coding relies heavily on the assumption that the residual frame has a lower entropy rate than that of the intra frame. Recognizing that this assumption is not always true due to dis-occlusion phenomena or unreliable motion estimates, we propose a masked conditional residual coding scheme. It learns a soft mask to form a hybrid of conditional coding and conditional residual coding in a pixel adaptive manner. We introduce a Transformer-based conditional autoencoder. Several strategies are investigated with regard to how to condition a Transformer-based autoencoder for inter-frame coding, a topic that is largely under-explored. Additionally, we propose a channel transform module (CTM) to decorrelate the image latents along the channel dimension, with the aim of using the simple hyperprior to approach similar compression performance to the channel-wise autoregressive model. Experimental results confirm the superiority of our masked conditional residual transformer (termed MaskCRT) to both conditional coding and conditional residual coding. On commonly used datasets, MaskCRT shows comparable BD-rate results to VTM-17.0 under the low delay P configuration in terms of PSNR-RGB. It also opens up a new research direction for advancing learned video compression.