Several coded exposure techniques have been proposed for acquiring high frame rate videos at low bandwidth. Most recently, a Coded-2-Bucket camera has been proposed that can acquire two compressed measurements in a single exposure, unlike previously proposed coded exposure techniques, which can acquire only a single measurement. Although two measurements are better than one for an effective video recovery, we are yet unaware of the clear advantage of two measurements, either quantitatively or qualitatively. Here, we propose a unified learning-based framework to make such a qualitative and quantitative comparison between those which capture only a single coded image (Flutter Shutter, Pixel-wise coded exposure) and those that capture two measurements per exposure (C2B). Our learning-based framework consists of a shift-variant convolutional layer followed by a fully convolutional deep neural network. Our proposed unified framework achieves the state of the art reconstructions in all three sensing techniques. Further analysis shows that when most scene points are static, the C2B sensor has a significant advantage over acquiring a single pixel-wise coded measurement. However, when most scene points undergo motion, the C2B sensor has only a marginal benefit over the single pixel-wise coded exposure measurement.
Lensless imaging has emerged as a potential solution towards realizing ultra-miniature cameras by eschewing the bulky lens in a traditional camera. Without a focusing lens, the lensless cameras rely on computational algorithms to recover the scenes from multiplexed measurements. However, the current iterative-optimization-based reconstruction algorithms produce noisier and perceptually poorer images. In this work, we propose a non-iterative deep learning based reconstruction approach that results in orders of magnitude improvement in image quality for lensless reconstructions. Our approach, called $\textit{FlatNet}$, lays down a framework for reconstructing high-quality photorealistic images from mask-based lensless cameras, where the camera's forward model formulation is known. FlatNet consists of two stages: (1) an inversion stage that maps the measurement into a space of intermediate reconstruction by learning parameters within the forward model formulation, and (2) a perceptual enhancement stage that improves the perceptual quality of this intermediate reconstruction. These stages are trained together in an end-to-end manner. We show high-quality reconstructions by performing extensive experiments on real and challenging scenes using two different types of lensless prototypes: one which uses a separable forward model and another, which uses a more general non-separable cropped-convolution model. Our end-to-end approach is fast, produces photorealistic reconstructions, and is easy to adopt for other mask-based lensless cameras.
Learning-based methods have enabled the recovery of a video sequence from a single motion-blurred image or a single coded exposure image. Recovering video from a single motion-blurred image is a very ill-posed problem and the recovered video usually has many artifacts. In addition to this, the direction of motion is lost and it results in motion ambiguity. However, it has the advantage of fully preserving the information in the static parts of the scene. The traditional coded exposure framework is better-posed but it only samples a fraction of the space-time volume, which is at best 50% of the space-time volume. Here, we propose to use the complementary information present in the fully-exposed (blurred) image along with the coded exposure image to recover a high fidelity video without any motion ambiguity. Our framework consists of a shared encoder followed by an attention module to selectively combine the spatial information from the fully-exposed image with the temporal information from the coded image, which is then super-resolved to recover a non-ambiguous high-quality video. The input to our algorithm is a fully-exposed and coded image pair. Such an acquisition system already exists in the form of a Coded-two-bucket (C2B) camera. We demonstrate that our proposed deep learning approach using blurred-coded image pair produces much better results than those from just a blurred image or just a coded image.
Under Display Cameras present a promising opportunity for phone manufacturers to achieve bezel-free displays by positioning the camera behind semi-transparent OLED screens. Unfortunately, such imaging systems suffer from severe image degradation due to light attenuation and diffraction effects. In this work, we present Deep Atrous Guided Filter (DAGF), a two-stage, end-to-end approach for image restoration in UDC systems. A Low-Resolution Network first restores image quality at low-resolution, which is subsequently used by the Guided Filter Network as a filtering input to produce a high-resolution output. Besides the initial downsampling, our low-resolution network uses multiple, parallel atrous convolutions to preserve spatial resolution and emulates multi-scale processing. Our approach's ability to directly train on megapixel images results in significant performance improvement. We additionally propose a simple simulation scheme to pre-train our model and boost performance. Our overall framework ranks 2nd and 5th in the RLQ-TOD'20 UDC Challenge for POLED and TOLED displays, respectively.
This paper is the report of the first Under-Display Camera (UDC) image restoration challenge in conjunction with the RLQ workshop at ECCV 2020. The challenge is based on a newly-collected database of Under-Display Camera. The challenge tracks correspond to two types of display: a 4k Transparent OLED (T-OLED) and a phone Pentile OLED (P-OLED). Along with about 150 teams registered the challenge, eight and nine teams submitted the results during the testing phase for each track. The results in the paper are state-of-the-art restoration performance of Under-Display Camera Restoration. Datasets and paper are available at https://yzhouas.github.io/projects/UDC/udc.html.
One of the important parameters for the assessment of glaucoma is optic nerve head (ONH) evaluation, which usually involves depth estimation and subsequent optic disc and cup boundary extraction. Depth is usually obtained explicitly from imaging modalities like optical coherence tomography (OCT) and is very challenging to estimate depth from a single RGB image. To this end, we propose a novel method using adversarial network to predict depth map from a single image. The proposed depth estimation technique is trained and evaluated using individual retinal images from INSPIRE-stereo dataset. We obtain a very high average correlation coefficient of 0.92 upon five fold cross validation outperforming the state of the art. We then use the depth estimation process as a proxy task for joint optic disc and cup segmentation.
Thermal imaging is a robust sensing technique but its consumer applicability is limited by the high cost of thermal sensors. Nevertheless, low-resolution thermal cameras are relatively affordable and are also usually accompanied by a high-resolution visible-range camera. This visible-range image can be used as a guide to reconstruct a high-resolution thermal image using guided super-resolution(GSR) techniques. However, the difference in wavelength-range of the input images makes this task challenging. Improper processing can introduce artifacts such as blur and ghosting, mainly due to texture and content mismatch. To this end, we propose a novel algorithm for guided super-resolution that explicitly tackles the issue of texture-mismatch caused due to multimodality. We propose a two-stage network that combines information from a low-resolution thermal and a high-resolution visible image with the help of multi-level edge-extraction and integration. The first stage of our network extracts edge-maps from the visual image at different pyramidal levels and the second stage integrates these edge-maps into our proposed super-resolution network at appropriate layers. Extraction and integration of edges belonging to different scales simplifies the task of GSR as it provides texture to object-level information in a progressive manner. Using multi-level edges also allows us to adjust the contribution of the visual image directly at the time of testing and thus provides controllability at test-time. We perform multiple experiments and show that our method performs better than existing state-of-the-art guided super-resolution methods both quantitatively and qualitatively.
Dual camera systems have assisted in the proliferation of various applications, such as optical zoom, low-light imaging and High Dynamic Range (HDR) imaging. In this work, we explore an optimal method for capturing the scene HDR and disparity map using dual camera setups. Hasinoff et al. (2010) have developed a noise optimal framework for HDR capture from a single camera. We generalize this to the dual camera set-up for estimating both HDR and disparity map. It may seem that dual camera systems can capture HDR in a shorter time. However, disparity estimation is a necessary step, which requires overlap among the images captured by the two cameras. This may lead to an increase in the capture time. To address this conflicting requirement, we propose a novel framework to find the optimal exposure and ISO sequence by minimizing the capture time under the constraints of an upper bound on the disparity error and a lower bound on the per-exposure SNR. We show that the resulting optimization problem is non-convex in general and propose an appropriate initialization technique. To obtain the HDR and disparity map from the optimal capture sequence, we propose a pipeline which alternates between estimating the camera ICRFs and the scene disparity map. We demonstrate that our optimal capture sequence leads to better results than other possible capture sequences. Our results are also close to those obtained by capturing the full stereo stack spanning the entire dynamic range. Finally, we present for the first time a stereo HDR dataset consisting of dense ISO and exposure stack captured from a smartphone dual camera. The dataset consists of 6 scenes, with an average of 142 exposure-ISO image sequence per scene.
Light Field (LF) offers unique advantages such as post-capture refocusing and depth estimation, but low-light conditions limit these capabilities. To restore low-light LFs we should harness the geometric cues present in different LF views, which is not possible using single-frame low-light enhancement techniques. We, therefore, propose a deep neural network for Low-Light Light Field (L3F) restoration, which we refer to as L3Fnet. The proposed L3Fnet not only performs the necessary visual enhancement of each LF view but also preserves the epipolar geometry across views. We achieve this by adopting a two-stage architecture for L3Fnet. Stage-I looks at all the LF views to encode the LF geometry. This encoded information is then used in Stage-II to reconstruct each LF view. To facilitate learning-based techniques for low-light LF imaging, we collected a comprehensive LF dataset of various scenes. For each scene, we captured four LFs, one with near-optimal exposure and ISO settings and the others at different levels of low-light conditions varying from low to extreme low-light settings. The effectiveness of the proposed L3Fnet is supported by both visual and numerical comparisons on this dataset. To further analyze the performance of low-light reconstruction methods, we also propose an L3F-wild dataset that contains LF captured late at night with almost zero lux values. No ground truth is available in this dataset. To perform well on the L3F-wild dataset, any method must adapt to the light level of the captured scene. To do this we propose a novel pre-processing block that makes L3Fnet robust to various degrees of low-light conditions. Lastly, we show that L3Fnet can also be used for low-light enhancement of single-frame images, despite it being engineered for LF data. We do so by converting the single-frame DSLR image into a form suitable to L3Fnet, which we call as pseudo-LF.