Rain removal aims to remove rain streaks from images/videos and reduce the disruptive effects caused by rain. It not only enhances image/video visibility but also allows many computer vision algorithms to function properly. This paper makes the first attempt to conduct a comprehensive study on the robustness of deep learning-based rain removal methods against adversarial attacks. Our study shows that, when the image/video is highly degraded, rain removal methods are more vulnerable to the adversarial attacks as small distortions/perturbations become less noticeable or detectable. In this paper, we first present a comprehensive empirical evaluation of various methods at different levels of attacks and with various losses/targets to generate the perturbations from the perspective of human perception and machine analysis tasks. A systematic evaluation of key modules in existing methods is performed in terms of their robustness against adversarial attacks. From the insights of our analysis, we construct a more robust deraining method by integrating these effective modules. Finally, we examine various types of adversarial attacks that are specific to deraining problems and their effects on both human and machine vision tasks, including 1) rain region attacks, adding perturbations only in the rain regions to make the perturbations in the attacked rain images less visible; 2) object-sensitive attacks, adding perturbations only in regions near the given objects. Code is available at https://github.com/yuyi-sd/Robust_Rain_Removal.
Learned image compression has achieved great success due to its excellent modeling capacity, but seldom further considers the Rate-Distortion Optimization (RDO) of each input image. To explore this potential in the learned codec, we make the first attempt to build a neural data-dependent transform and introduce a continuous online mode decision mechanism to jointly optimize the coding efficiency for each individual image. Specifically, apart from the image content stream, we employ an additional model stream to generate the transform parameters at the decoder side. The presence of a model stream enables our model to learn more abstract neural-syntax, which helps cluster the latent representations of images more compactly. Beyond the transform stage, we also adopt neural-syntax based post-processing for the scenarios that require higher quality reconstructions regardless of extra decoding overhead. Moreover, the involvement of the model stream further makes it possible to optimize both the representation and the decoder in an online way, i.e. RDO at the testing time. It is equivalent to a continuous online mode decision, like coding modes in the traditional codecs, to improve the coding efficiency based on the individual input image. The experimental results show the effectiveness of the proposed neural-syntax design and the continuous online mode decision mechanism, demonstrating the superiority of our method in coding efficiency compared to the latest conventional standard Versatile Video Coding (VVC) and other state-of-the-art learning-based methods.
There is an increasing consensus that the design and optimization of low light image enhancement methods need to be fully driven by perceptual quality. With numerous approaches proposed to enhance low-light images, much less work has been dedicated to quality assessment and quality optimization of low-light enhancement. In this paper, to close the gap between enhancement and assessment, we propose a loop enhancement framework that produces a clear picture of how the enhancement of low-light images could be optimized towards better visual quality. In particular, we create a large-scale database for QUality assessment Of The Enhanced LOw-Light Image (QUOTE-LOL), which serves as the foundation in studying and developing objective quality assessment measures. The objective quality assessment measure plays a critical bridging role between visual quality and enhancement and is further incorporated in the optimization in learning the enhancement model towards perceptual optimally. Finally, we iteratively perform the enhancement and optimization tasks, enhancing the low-light images continuously. The superiority of the proposed scheme is validated based on various low-light scenes. The database as well as the code will be available.
Images captured in the low-light condition suffer from low visibility and various imaging artifacts, e.g., real noise. Existing supervised enlightening algorithms require a large set of pixel-aligned training image pairs, which are hard to prepare in practice. Though weakly-supervised or unsupervised methods can alleviate such challenges without using paired training images, some real-world artifacts inevitably get falsely amplified because of the lack of corresponded supervision. In this paper, instead of using perfectly aligned images for training, we creatively employ the misaligned real-world images as the guidance, which are considerably easier to collect. Specifically, we propose a Cross-Image Disentanglement Network (CIDN) to separately extract cross-image brightness and image-specific content features from low/normal-light images. Based on that, CIDN can simultaneously correct the brightness and suppress image artifacts in the feature domain, which largely increases the robustness to the pixel shifts. Furthermore, we collect a new low-light image enhancement dataset consisting of misaligned training images with real-world corruptions. Experimental results show that our model achieves state-of-the-art performances on both the newly proposed dataset and other popular low-light datasets.
In this paper, we make the first benchmark effort to elaborate on the superiority of using RAW images in the low light enhancement and develop a novel alternative route to utilize RAW images in a more flexible and practical way. Inspired by a full consideration on the typical image processing pipeline, we are inspired to develop a new evaluation framework, Factorized Enhancement Model (FEM), which decomposes the properties of RAW images into measurable factors and provides a tool for exploring how properties of RAW images affect the enhancement performance empirically. The empirical benchmark results show that the Linearity of data and Exposure Time recorded in meta-data play the most critical role, which brings distinct performance gains in various measures over the approaches taking the sRGB images as input. With the insights obtained from the benchmark results in mind, a RAW-guiding Exposure Enhancement Network (REENet) is developed, which makes trade-offs between the advantages and inaccessibility of RAW images in real applications in a way of using RAW images only in the training phase. REENet projects sRGB images into linear RAW domains to apply constraints with corresponding RAW images to reduce the difficulty of modeling training. After that, in the testing phase, our REENet does not rely on RAW images. Experimental results demonstrate not only the superiority of REENet to state-of-the-art sRGB-based methods and but also the effectiveness of the RAW guidance and all components.
Low-light image enhancement (LLE) remains challenging due to the unfavorable prevailing low-contrast and weak-visibility problems of single RGB images. In this paper, we respond to the intriguing learning-related question -- if leveraging both accessible unpaired over/underexposed images and high-level semantic guidance, can improve the performance of cutting-edge LLE models? Here, we propose an effective semantically contrastive learning paradigm for LLE (namely SCL-LLE). Beyond the existing LLE wisdom, it casts the image enhancement task as multi-task joint learning, where LLE is converted into three constraints of contrastive learning, semantic brightness consistency, and feature preservation for simultaneously ensuring the exposure, texture, and color consistency. SCL-LLE allows the LLE model to learn from unpaired positives (normal-light)/negatives (over/underexposed), and enables it to interact with the scene semantics to regularize the image enhancement network, yet the interaction of high-level semantic knowledge and the low-level signal prior is seldom investigated in previous methods. Training on readily available open data, extensive experiments demonstrate that our method surpasses the state-of-the-arts LLE models over six independent cross-scenes datasets. Moreover, SCL-LLE's potential to benefit the downstream semantic segmentation under extremely dark conditions is discussed. Source Code: https://github.com/LingLIx/SCL-LLE.
Video Coding for Machines (VCM) is committed to bridging to an extent separate research tracks of video/image compression and feature compression, and attempts to optimize compactness and efficiency jointly from a unified perspective of high accuracy machine vision and full fidelity human vision. In this paper, we summarize VCM methodology and philosophy based on existing academia and industrial efforts. The development of VCM follows a general rate-distortion optimization, and the categorization of key modules or techniques is established. From previous works, it is demonstrated that, although existing works attempt to reveal the nature of scalable representation in bits when dealing with machine and human vision tasks, there remains a rare study in the generality of low bit rate representation, and accordingly how to support a variety of visual analytic tasks. Therefore, we investigate a novel visual information compression for the analytics taxonomy problem to strengthen the capability of compact visual representations extracted from multiple tasks for visual analytics. A new perspective of task relationships versus compression is revisited. By keeping in mind the transferability among different machine vision tasks (e.g. high-level semantic and mid-level geometry-related), we aim to support multiple tasks jointly at low bit rates. In particular, to narrow the dimensionality gap between neural network generated features extracted from pixels and a variety of machine vision features/labels (e.g. scene class, segmentation labels), a codebook hyperprior is designed to compress the neural network-generated features. As demonstrated in our experiments, this new hyperprior model is expected to improve feature compression efficiency by estimating the signal entropy more accurately, which enables further investigation of the granularity of abstracting compact features among different tasks.
To enhance low-light images to normally-exposed ones is highly ill-posed, namely that the mapping relationship between them is one-to-many. Previous works based on the pixel-wise reconstruction losses and deterministic processes fail to capture the complex conditional distribution of normally exposed images, which results in improper brightness, residual noise, and artifacts. In this paper, we investigate to model this one-to-many relationship via a proposed normalizing flow model. An invertible network that takes the low-light images/features as the condition and learns to map the distribution of normally exposed images into a Gaussian distribution. In this way, the conditional distribution of the normally exposed images can be well modeled, and the enhancement process, i.e., the other inference direction of the invertible network, is equivalent to being constrained by a loss function that better describes the manifold structure of natural images during the training. The experimental results on the existing benchmark datasets show our method achieves better quantitative and qualitative results, obtaining better-exposed illumination, less noise and artifact, and richer colors.
Visual analytics have played an increasingly critical role in the Internet of Things, where massive visual signals have to be compressed and fed into machines. But facing such big data and constrained bandwidth capacity, existing image/video compression methods lead to very low-quality representations, while existing feature compression techniques fail to support diversified visual analytics applications/tasks with low-bit-rate representations. In this paper, we raise and study the novel problem of supporting multiple machine vision analytics tasks with the compressed visual representation, namely, the information compression problem in analytics taxonomy. By utilizing the intrinsic transferability among different tasks, our framework successfully constructs compact and expressive representations at low bit-rates to support a diversified set of machine vision tasks, including both high-level semantic-related tasks and mid-level geometry analytic tasks. In order to impose compactness in the representations, we propose a codebook-based hyperprior, which helps map the representation into a low-dimensional manifold. As it well fits the signal structure of the deep visual feature, it facilitates more accurate entropy estimation, and results in higher compression efficiency. With the proposed framework and the codebook-based hyperprior, we further investigate the relationship of different task features owning different levels of abstraction granularity. Experimental results demonstrate that with the proposed scheme, a set of diversified tasks can be supported at a significantly lower bit-rate, compared with existing compression schemes.