Human gaze is essential for various appealing applications. Aiming at more accurate gaze estimation, a series of recent works propose to utilize face and eye images simultaneously. Nevertheless, face and eye images only serve as independent or parallel feature sources in those works, the intrinsic correlation between their features is overlooked. In this paper we make the following contributions: 1) We propose a coarse-to-fine strategy which estimates a basic gaze direction from face image and refines it with corresponding residual predicted from eye images. 2) Guided by the proposed strategy, we design a framework which introduces a bi-gram model to bridge gaze residual and basic gaze direction, and an attention component to adaptively acquire suitable fine-grained feature. 3) Integrating the above innovations, we construct a coarse-to-fine adaptive network named CA-Net and achieve state-of-the-art performances on MPIIGaze and EyeDiap.
Typical methods for text-to-image synthesis seek to design effective generative architecture to model the text-to-image mapping directly. It is fairly arduous due to the cross-modality translation involved in the task of text-to-image synthesis. In this paper we circumvent this problem by focusing on parsing the content of both the input text and the synthesized image thoroughly to model the text-to-image consistency in the semantic level. In particular, we design a memory structure to parse the textual content by exploring semantic correspondence between each word in the vocabulary to its various visual contexts across relevant images in training data during text encoding. On the other hand, the synthesized image is parsed to learn its semantics in an object-aware manner. Moreover, we customize a conditional discriminator, which models the fine-grained correlations between words and image sub-regions to push for the cross-modality semantic alignment between the input text and the synthesized image. Thus, a full-spectrum content-oriented parsing in the deep semantic level is performed by our model, which is referred to as Content-Parsing Generative Adversarial Networks (CPGAN). Extensive experiments on COCO dataset manifest that CPGAN advances the state-of-the-art performance significantly.
Thanks to the latest progress in image sensor manufacturing technology, the emergence of the single-chip polarized color sensor is likely to bring advantages to computer vision tasks. Despite the importance of the sensor, joint chromatic and polarimetric demosaicing is the key to obtaining the high-quality RGB-Polarization image for the sensor. Since the polarized color sensor is equipped with a new type of chip, the demosaicing problem cannot be currently well-addressed by former methods. In this paper, we propose a joint chromatic and polarimetric demosaicing model to address this challenging problem. To solve this non-convex problem, we further present a sparse representation-based optimization strategy that utilizes chromatic information and polarimetric information to jointly optimize the model. In addition, we build an optical data acquisition system to collect an RGB-Polarization dataset. Results of both qualitative and quantitative experiments have shown that our method is capable of faithfully recovering full 12-channel chromatic and polarimetric information for each pixel from a single mosaic input image. Moreover, we show that the proposed method can perform well not only on the synthetic data but the real captured data.
Intrinsic image decomposition, which is an essential task in computer vision, aims to infer the reflectance and shading of the scene. It is challenging since it needs to separate one image into two components. To tackle this, conventional methods introduce various priors to constrain the solution, yet with limited performance. Meanwhile, the problem is typically solved by supervised learning methods, which is actually not an ideal solution since obtaining ground truth reflectance and shading for massive general natural scenes is challenging and even impossible. In this paper, we propose a novel unsupervised intrinsic image decomposition framework, which relies on neither labeled training data nor hand-crafted priors. Instead, it directly learns the latent feature of reflectance and shading from unsupervised and uncorrelated data. To enable this, we explore the independence between reflectance and shading, the domain invariant content constraint and the physical constraint. Extensive experiments on both synthetic and real image datasets demonstrate consistently superior performance of the proposed method.
Real-time semantic segmentation plays a significant role in industry applications, such as autonomous driving, robotics and so on. It is a challenging task as both efficiency and performance need to be considered simultaneously. To address such a complex task, this paper proposes an efficient CNN called Multiply Spatial Fusion Network (MSFNet) to achieve fast and accurate perception. The proposed MSFNet uses Class Boundary Supervision to process the relevant boundary information based on our proposed Multi-features Fusion Module which can obtain spatial information and enlarge receptive field. Therefore, the final upsampling of the feature maps of 1/8 original image size can achieve impressive results while maintaining a high speed. Experiments on Cityscapes and Camvid datasets show an obvious advantage of the proposed approach compared with the existing approaches. Specifically, it achieves 77.1% Mean IOU on the Cityscapes test dataset with the speed of 41 FPS for a 1024*2048 input, and 75.4% Mean IOU with the speed of 91 FPS on the Camvid test dataset.
This document introduces the background and the usage of the Hyperspectral City Dataset and the benchmark. The documentation first starts with the background and motivation of the dataset. Follow it, we briefly describe the method of collecting the dataset and the processing method from raw dataset to the final release dataset, specifically, the version 1.0. We also provide the detailed usage of the dataset and the evaluation metric for submitted the result for the 2019 Hyperspectral City Challenge.
Low-light image enhancement is a challenging task since various factors, including brightness, contrast, artifacts and noise, should be handled simultaneously and effectively. To address such a difficult problem, this paper proposes a novel attention-guided enhancement solution and delivers the corresponding end-to-end multi-branch CNNs. The key of our method is the computation of two attention maps to guide the exposure enhancement and denoising respectively. In particular, the first attention map distinguishes underexposed regions from normally exposed regions, while the second attention map distinguishes noises from real-world textures. Under their guidance, the proposed multi-branch enhancement network can work in an adaptive way. Other contributions of this paper include the "decomposition/multi-branch-enhancement/fusion" design of the enhancement network, the reinforcement-net for contrast enhancement, and the proposed large-scale low-light enhancement dataset. We evaluate the proposed method through extensive experiments, and the results demonstrate that our solution outperforms state-of-the-art methods by a large margin. We additionally show that our method is flexible and effective for other image processing tasks.
Reflection is common in images capturing scenes behind a glass window, which is not only a disturbance visually but also influence the performance of other computer vision algorithms. Single image reflection removal is an ill-posed problem because the color at each pixel needs to be separated into two values, i.e., the desired clear background and the reflection. To solve it, existing methods propose priors such as smoothness, color consistency. However, the low-level priors are not reliable in complex scenes, for instance, when capturing a real outdoor scene through a window, both the foreground and background contain both smooth and sharp area and a variety of color. In this paper, inspired by the fact that human can separate the two layers easily by recognizing the objects, we use the object semantic as guidance to force the same semantic object belong to the same layer. Extensive experiments on different datasets show that adding the semantic information offers a significant improvement to reflection separation. We also demonstrate the applications of the proposed method to other computer vision tasks.
Deep neural networks (DNNs) have become popular for medical image analysis tasks like cancer diagnosis and lesion detection. However, a recent study demonstrates that medical deep learning systems can be compromised by carefully-engineered adversarial examples/attacks, i.e., small imperceptible perturbations can fool DNNs to predict incorrectly. This raises safety concerns about the deployment of deep learning systems in clinical settings. In this paper, we provide a deeper understanding of adversarial examples in the context of medical images. We find that medical DNN models can be more vulnerable to adversarial attacks compared to natural ones from three different viewpoints: 1) medical image DNNs that have only a few classes are generally easier to be attacked; 2) the complex biological textures of medical images may lead to more vulnerable regions; and most importantly, 3) state-of-the-art deep networks designed for large-scale natural image processing can be overparameterized for medical imaging tasks and result in high vulnerability to adversarial attacks. Surprisingly, we also find that medical adversarial attacks can be easily detected, i.e., simple detectors can achieve over 98% detection AUCs against state-of-the-art attacks, due to their fundamental feature difference from normal examples. We show this is because adversarial attacks tend to attack a wide spread area outside the pathological regions, which results in deep features that are fundamentally different and easily separable from normal features. We believe these findings may be a useful basis to approach the design of secure medical deep learning systems.
Unsupervised single image layer separation aims at extracting two layers from an input image where these layers follow different distributions. This problem arises most notably in reflection inference removal and intrinsic image decomposition. Since there exist an infinite set of combinations that can construct the given input image, one could infer nothing about the solutions without additional assumptions. To address the problem, we make the shared information consistency assumption and separated layer independence assumption to constrain the solutions. In this end, we propose an unsupervised single image separation framework based on cycle GANs and self-supervised learning. The proposed framework is applied for the reflection removal and intrinsic image problems. Numerical and visual results show that the proposed method achieves the state-of-the-art performance among unsupervised methods which require single image as input. Based on the slightly modified version of the presented framework, we also demonstrate the promising results of decomposing an image into three layer.