One of the key components for video deblurring is how to exploit neighboring frames. Recent state-of-the-art methods either used aligned adjacent frames to the center frame or propagated the information on past frames to the current frame recurrently. Here we propose multi-blur-to-deblur (MB2D), a novel concept to exploit neighboring frames for efficient video deblurring. Firstly, inspired by unsharp masking, we argue that using more blurred images with long exposures as additional inputs significantly improves performance. Secondly, we propose multi-blurring recurrent neural network (MBRNN) that can synthesize more blurred images from neighboring frames, yielding substantially improved performance with existing video deblurring methods. Lastly, we propose multi-scale deblurring with connecting recurrent feature map from MBRNN (MSDR) to achieve state-of-the-art performance on the popular GoPro and Su datasets in fast and memory efficient ways.
Deep learning has achieved remarkable performance in various tasks thanks to massive labeled datasets. However, there are often cases where labeling large amount of data is challenging or infeasible due to high labeling cost such as labeling by experts or long labeling time per large-scale data sample (e.g., video, very large image). Active learning is one of the ways to query the most informative samples to be annotated among massive unlabeled pool. Two promising directions for active learning that have been recently explored are data distribution-based approach to select data points that are far from current labeled pool and model uncertainty-based approach that relies on the perspective of task model. Unfortunately, the former does not exploit structures from tasks and the latter does not seem to well-utilize overall data distribution. Here, we propose the methods that simultaneously take advantage of both data distribution and model uncertainty approaches. Our proposed methods exploit variational adversarial active learning (VAAL), that considered data distribution of both label and unlabeled pools, by incorporating learning loss prediction module and RankCGAN concept into VAAL by modeling loss prediction as a ranker. We demonstrate that our proposed methods outperform recent state-of-the-art active learning methods on various balanced and imbalanced benchmark datasets.
Multi-scale (MS) approaches have been widely investigated for blind single image / video deblurring that sequentially recovers deblurred images in low spatial scale first and then in high spatial scale later with the output of lower scales. MS approaches have been effective especially for severe blurs induced by large motions in high spatial scale since those can be seen as small blurs in low spatial scale. In this work, we investigate alternative approach to MS, called multi-temporal (MT) approach, for non-uniform single image deblurring. We propose incremental temporal training with constructed MT level dataset from time-resolved dataset, develop novel MT-RNNs with recurrent feature maps, and investigate progressive single image deblurring over iterations. Our proposed MT methods outperform state-of-the-art MS methods on the GoPro dataset in PSNR with the smallest number of parameters.
Recently, robotic grasp detection (GD) and object detection (OD) with reasoning have been investigated using deep neural networks (DNNs). There have been works to combine these multi-tasks using separate networks so that robots can deal with situations of grasping specific target objects in the cluttered, stacked, complex piles of novel objects from a single RGB-D camera. We propose a single multi-task DNN that yields the information on GD, OD and relationship reasoning among objects with a simple post-processing. Our proposed methods yielded state-of-the-art performance with the accuracy of 98.6% and 74.2% and the computation speed of 33 and 62 frame per second on VMRD and Cornell datasets, respectively. Our methods also yielded 95.3% grasp success rate for single novel object grasping with a 4-axis robot arm and 86.7% grasp success rate in cluttered novel objects with a Baxter robot.
Multi-scale approach has been used for blind image / video deblurring problems to yield excellent performance for both conventional and recent deep-learning-based state-of-the-art methods. Bicubic down-sampling is a typical choice for multi-scale approach to reduce spatial dimension after filtering with a fixed kernel. However, this fixed kernel may be sub-optimal since it may destroy important information for reliable deblurring such as strong edges. We propose convolutional neural network (CNN)-based down-scale methods for multi-scale deep-learning-based non-uniform single image deblurring. We argue that our CNN-based down-scaling effectively reduces the spatial dimension of the original image, while learned kernels with multiple channels may well-preserve necessary details for deblurring tasks. For each scale, we adopt to use RCAN (Residual Channel Attention Networks) as a backbone network to further improve performance. Our proposed method yielded state-of-the-art performance on GoPro dataset by large margin. Our proposed method was able to achieve 2.59dB higher PSNR than the current state-of-the-art method by Tao. Our proposed CNN-based down-scaling was the key factor for this excellent performance since the performance of our network without it was decreased by 1.98dB. The same networks trained with GoPro set were also evaluated on large-scale Su dataset and our proposed method yielded 1.15dB better PSNR than the Tao's method. Qualitative comparisons on Lai dataset also confirmed the superior performance of our proposed method over other state-of-the-art methods.
Recently, Noise2Noise has been proposed for unsupervised training of deep neural networks in image restoration problems including denoising Gaussian noise. However, it does not work well for truncated noise with non-zero mean. Here, we perform theoretical analysis on Noise2Noise for the limited case of Gaussian noise removal using Stein's Unbiased Risk Estimator (SURE). We extend SURE to deal with a pair of noise realizations to directly compare with Noise2Noise. Then, we show that Noise2Noise with Gaussian noise is a special case of our newly extended SURE with a pair of uncorrelated noise realizations. Lastly, we propose a compensation method for clipped Gaussian noise to approximately follow Normal distribution and show how this compensation method can be used for SURE based unsupervised denoiser training. We also show that our theoretical analysis provides insights on how to use Noise2Noise for clipped Gaussian noise.
Conventional optimization based methods have utilized forward models with image priors to solve inverse problems in image processing. Recently, deep neural networks (DNN) have been investigated to significantly improve the image quality of the solution for inverse problems. Most DNN based inverse problems have focused on using data-driven image priors with massive amount of data. However, these methods often do not inherit nice properties of conventional approaches using theoretically well-grounded optimization algorithms such as monotone, global convergence. Here we investigate another possibility of using DNN for inverse problems in image processing. We propose methods to use DNNs to seamlessly speed up convergence rates of conventional optimization based methods. Our DNN-incorporated scaled gradient projection methods, without breaking theoretical properties, significantly improved convergence speed over state-of-the-art conventional optimization methods such as ISTA or FISTA in practice for inverse problems such as image inpainting, compressive image recovery with partial Fourier samples, image deblurring, and medical image reconstruction with sparse-view projections.
Deep learning has improved many computer vision tasks by utilizing data-driven features instead of using hand-crafted features. However, geometric transformations of input images often degrade the performance of deep learning based methods. In particular, rotation-invariant features are important in computer vision tasks such as face detection, biological feature detection of microscopy images, or robot grasp detection since the input image can be fed into the network with any rotation angle. In this paper, we propose rotation ensemble module (REM) to efficiently train and utilize rotation-invariant features in a deep neural network for computer vision tasks. We evaluated our proposed REM with face detection tasks on FDDB dataset, robotic grasp detection tasks on Cornell dataset, and real robotic grasp tasks with several novel objects. REM based face detection deep neural networks yielded up to 50.8% accuracy in face detection task on FDDB dataset at false rate 20 with IOU 75%, which is about 10.7% higher than the baseline. Robotic grasp detection deep neural networks with our REM also yielded up to 97.6% accuracy in robotic grasp detection on Cornell dataset that is higher than current state-of-the-art performance. In robotic grasp task using a real 4-axis robotic arm with several novel objects, our REM based robotic grasp achieved up to 93.8%, which is significantly higher than the baseline robotic grasps (11.0-56.3%).
Deep learning based single image super-resolution (SR) methods have been rapidly evolved over the past few years and have yielded state-of-the-art performances over conventional methods. Since these methods usually minimized l1 loss between the output SR image and the ground truth image, they yielded very high peak signal-to-noise ratio (PSNR) that is inversely proportional to these losses. Unfortunately, minimizing these losses inevitably lead to blurred edges due to averaging of plausible solutions. Recently, SRGAN was proposed to avoid this average effect by minimizing perceptual losses instead of l1 loss and it yielded perceptually better SR images (or images with sharp edges) at the price of lowering PSNR. In this paper, we propose SREdgeNet, edge enhanced single image SR network, that was inspired by conventional SR theories so that average effect could be avoided not by changing the loss, but by changing the SR network property with the same l1 loss. Our SREdgeNet consists of 3 sequential deep neural network modules: the first module is any state-of-the-art SR network and we selected a variant of EDSR. The second module is any edge detection network taking the output of the first SR module as an input and we propose DenseEdgeNet for this module. Lastly, the third module is merging the outputs of the first and second modules to yield edge enhanced SR image and we propose MergeNet for this module. Qualitatively, our proposed method yielded images with sharp edges compared to other state-of-the-art SR methods. Quantitatively, our SREdgeNet yielded state-of-the-art performance in terms of structural similarity (SSIM) while maintained comparable PSNR for x8 enlargement.
Robotic grasp detection for novel objects is a challenging task, but for the last few years, deep learning based approaches have achieved remarkable performance improvements, up to 96.1% accuracy, with RGB-D data. In this paper, we propose fully convolutional neural network (FCNN) based methods for robotic grasp detection. Our methods also achieved state-of-the-art detection accuracy (up to 96.6%) with state-of- the-art real-time computation time for high-resolution images (6-20ms per 360x360 image) on Cornell dataset. Due to FCNN, our proposed method can be applied to images with any size for detecting multigrasps on multiobjects. Proposed methods were evaluated using 4-axis robot arm with small parallel gripper and RGB-D camera for grasping challenging small, novel objects. With accurate vision-robot coordinate calibration through our proposed learning-based, fully automatic approach, our proposed method yielded 90% success rate.