Friedrich-Alexander Universität Erlangen-Nürnberg
Abstract:Retrieving the reflectance spectrum from objects is an essential task for many classification and detection problems, since many materials and processes have a unique spectral behaviour. In many cases, it is highly desirable to capture hyperspectral images due to the high spectral flexibility. Often, it is even necessary to capture hyperspectral videos or at least to be able to record a hyperspectral image at once, also called snapshot hyperspectral imaging, to avoid spectral smearing. For this task, a high-resolution snapshot hyperspectral camera array using a hexagonal shape is introduced.The hexagonal array for hyperspectral imaging uses off-the-shelf hardware, which enables high flexibility regarding employed cameras, lenses and filters. Hence, the spectral range can be easily varied by mounting a different set of filters. Moreover, the concept of using off-the-shelf hardware enables low prices in comparison to other approaches with highly specialized hardware. Since classical industrial cameras are used in this hyperspectral camera array, the spatial and temporal resolution is very high, while recording 37 hyperspectral channels in the range from 400 nm to 760 nm in 10 nm steps. A registration process is required for near-field imaging, which maps the peripheral camera views to the center view. It is shown that this combination using a hyperspectral camera array and the corresponding image registration pipeline is superior in comparison to other popular snapshot approaches. For this evaluation, a synthetic hyperspectral database is rendered. On the synthetic data, the novel approach outperforms its best competitor by more than 3 dB in reconstruction quality. This synthetic data is also used to show the superiority of the hexagonal shape in comparison to an orthogonal-spaced one. Moreover, a real-world high resolution hyperspectral video database is provided.
Abstract:Enabling high compression efficiency while keeping encoding energy consumption at a low level, requires prioritization of which videos need more sophisticated encoding techniques. However, the effects vary highly based on the content, and information on how good a video can be compressed is required. This can be measured by estimating the encoded bitstream size prior to encoding. We identified the errors between estimated motion vectors from Motion Search, an algorithm that predicts temporal changes in videos, correlates well to the encoded bitstream size. Combining Motion Search with Random Forests, the encoding bitrate can be estimated with a Pearson correlation of above 0.96.
Abstract:In this work, we present a comparison between color spaces namely YUV, LAB, RGB and their effect on learned image compression. For this we use the structure and color based learned image codec (SLIC) from our prior work, which consists of two branches - one for the luminance component (Y or L) and another for chrominance components (UV or AB). However, for the RGB variant we input all 3 channels in a single branch, similar to most learned image codecs operating in RGB. The models are trained for multiple bitrate configurations in each color space. We report the findings from our experiments by evaluating them on various datasets and compare the results to state-of-the-art image codecs. The YUV model performs better than the LAB variant in terms of MS-SSIM with a Bj{\o}ntegaard delta bitrate (BD-BR) gain of 7.5\% using VTM intra-coding mode as the baseline. Whereas the LAB variant has a better performance than YUV model in terms of CIEDE2000 having a BD-BR gain of 8\%. Overall, the RGB variant of SLIC achieves the best performance with a BD-BR gain of 13.14\% in terms of MS-SSIM and a gain of 17.96\% in CIEDE2000 at the cost of a higher model complexity.
Abstract:Multispectral imaging aims at recording images in different spectral bands. This is extremely beneficial in diverse discrimination applications, for example in agriculture, recycling or healthcare. One approach for snapshot multispectral imaging, which is capable of recording multispectral videos, is by using camera arrays, where each camera records a different spectral band. Since the cameras are at different spatial positions, a registration procedure is necessary to map every camera to the same view. In this paper, we present a multispectral snapshot image registration with three novel components. First, a cross spectral disparity estimation network is introduced, which is trained on a popular stereo database using pseudo spectral data augmentation. Subsequently, this disparity estimation is used to accurately detect occlusions by warping the disparity map in a layer-wise manner. Finally, these detected occlusions are reconstructed by a learned deep guided neural network, which leverages the structure from other spectral components. It is shown that each element of this registration process as well as the final result is superior to the current state of the art. In terms of PSNR, our registration achieves an improvement of over 3 dB. At the same time, the runtime is decreased by a factor of over 3 on a CPU. Additionally, the registration is executable on a GPU, where the runtime can be decreased by a factor of 111. The source code and the data is available at https://github.com/FAU-LMS/MSIR.
Abstract:Today, image and video data is not only viewed by humans, but also automatically analyzed by computer vision algorithms. However, current coding standards are optimized for human perception. Emerging from this, research on video coding for machines tries to develop coding methods designed for machines as information sink. Since many of these algorithms are based on neural networks, most proposals for video coding for machines build upon neural compression. So far, optimizing the compression by applying the task loss of the analysis network, for which ground truth data is needed, is achieving the best coding performance. But ground truth data is difficult to obtain and thus an optimization without ground truth is preferred. In this paper, we present an annotation-free optimization strategy for video coding for machines. We measure the distortion by calculating the task loss of the analysis network. Therefore, the predictions on the compressed image are compared with the predictions on the original image, instead of the ground truth data. Our results show that this strategy can even outperform training with ground truth data with rate savings of up to 7.5 %. By using the non-annotated training data, the rate gains can be further increased up to 8.2 %.
Abstract:The Bj{\o}ntegaard Delta rate (BD-rate) objectively assesses the coding efficiency of video codecs using the rate-distortion (R-D) performance but overlooks encoding energy, which is crucial in practical applications, especially for those on handheld devices. Although R-D analysis can be extended to incorporate encoding energy as energy-distortion (E-D), it fails to integrate all three parameters seamlessly. This work proposes a novel approach to address this limitation by introducing a 3D representation of rate, encoding energy, and distortion through surface fitting. In addition, we evaluate various surface fitting techniques based on their accuracy and investigate the proposed 3D representation and its projections. The overlapping areas in projections help in encoder selection and recommend avoiding the slow presets of the older encoders (x264, x265), as the recent encoders (x265, VVenC) offer higher quality for the same bitrate-energy performance and provide a lower rate for the same energy-distortion performance.
Abstract:Learned wavelet image and video coding approaches provide an explainable framework with a latent space corresponding to a wavelet decomposition. The wavelet image coder iWave++ achieves state-of-the-art performance and has been employed for various compression tasks, including lossy as well as lossless image, video, and medical data compression. However, the approaches suffer from slow decoding speed due to the autoregressive context model used in iWave++. In this paper, we show how a parallelized context model can be integrated into the iWave++ framework. Our experimental results demonstrate a speedup factor of over 350 and 240 for image and video compression, respectively. At the same time, the rate-distortion performance in terms of Bj{\o}ntegaard delta bitrate is slightly worse by 1.5\% for image coding and 1\% for video coding. In addition, we analyze the learned wavelet decomposition by visualizing its subband impulse responses.
Abstract:With the increasing efforts of bringing high-quality virtual reality technologies into the market, efficient 360-degree video compression gains in importance. As such, the state-of-the-art H.266/VVC video coding standard integrates dedicated tools for 360-degree video, and considerable efforts have been put into designing 360-degree projection formats with improved compression efficiency. For the fast-evolving field of neural video compression networks (NVCs), the effects of different 360-degree projection formats on the overall compression performance have not yet been investigated. It is thus unclear, whether a resampling from the conventional equirectangular projection (ERP) to other projection formats yields similar gains for NVCs as for hybrid video codecs, and which formats perform best. In this paper, we analyze several generations of NVCs and an extensive set of 360-degree projection formats with respect to their compression performance for 360-degree video. Based on our analysis, we find that projection format resampling yields significant improvements in compression performance also for NVCs. The adjusted cubemap projection (ACP) and equatorial cylindrical projection (ECP) show to perform best and achieve rate savings of more than 55% compared to ERP based on WS-PSNR for the most recent NVC. Remarkably, the observed rate savings are higher than for H.266/VVC, emphasizing the importance of projection format resampling for NVCs.
Abstract:Energy efficiency for video communications and video-on-demand streaming is essential for mobile devices with a limited battery capacity. Therefore, hardware (HW) decoder implementations are commonly used to significantly reduce the energetic load of video playback. The energy consumption of such a HW implementation largely depends on a previously finalized standardization of a video codec that specifies which coding tools and methods are included in the new video codec. However, during the standardization, the true complexity of a HW implementation is unknown, and the adoption of coding tools relies solely on the expertise of experts in the industry. By using software (SW) decoder profiling, we are able to estimate the SW decoding energy demand with an average error of 1.25%. We propose a method that accurately models the energy demand of existing HW decoders with an average error of 1.79% by exploiting information from software (SW) decoder profiling. Motivated by the low estimation error, we propose a HW decoding energy metric that can predict and estimate the complexity of an unknown HW implementation using information from existing HW decoder implementations and available SW implementations of the future video decoder. By using multiple video codecs for model training, we can predict the complexity of a HW decoder with an error of less than 8% and a minimum error of 4.54% without using the corresponding HW decoder for training.
Abstract:Energy and compression efficiency are two essential parts of modern video decoder implementations that have to be considered. This work comprehensively studies the following six video coding formats regarding compression and decoding energy efficiency: AVC, VP9, HEVC, AV1, VVC, and AVM. We first evaluate the energy demand of reference and optimized software decoder implementations. Furthermore, we consider the influence of the usage of SIMD instructions on those decoder implementations. We find that AV1 is a sweet spot for optimized software decoder implementations with an additional energy demand of 16.55% and bitrate savings of -43.95% compared to VP9. We furthermore evaluate the hardware decoding energy demand of four video coding formats. Thereby, we show that AV1 has energy demand increases by 117.50% compared to VP9. For HEVC, we found a sweet spot in terms of energy demand with an increase of 6.06% with respect to VP9. Relative to their optimized software counterparts, hardware video decoders reduce the energy consumption to less than 9% compared to software decoders.