In deep networks, the lost data details significantly degrade the performances of image segmentation. In this paper, we propose to apply Discrete Wavelet Transform (DWT) to extract the data details during feature map down-sampling, and adopt Inverse DWT (IDWT) with the extracted details during the up-sampling to recover the details. We firstly transform DWT/IDWT as general network layers, which are applicable to 1D/2D/3D data and various wavelets like Haar, Cohen, and Daubechies, etc. Then, we design wavelet integrated deep networks for image segmentation (WaveSNets) based on various architectures, including U-Net, SegNet, and DeepLabv3+. Due to the effectiveness of the DWT/IDWT in processing data details, experimental results on CamVid, Pascal VOC, and Cityscapes show that our WaveSNets achieve better segmentation performances than their vanilla versions.
Recent weakly-supervised semantic segmentation (WSSS) has made remarkable progress due to class-wise localization techniques using image-level labels. Meanwhile, weakly-supervised instance segmentation (WSIS) is a more challenging task because instance-wise localization using only image-level labels is quite difficult. Consequently, most WSIS approaches exploit off-the-shelf proposal technique that requires pre-training with high-level labels, deviating a fully image-level supervised setting. Moreover, we focus on semantic drift problem, $i.e.,$ missing instances in pseudo instance labels are categorized as background class, occurring confusion between background and instance in training. To this end, we propose a novel approach that consists of two innovative components. First, we design a semantic knowledge transfer to obtain pseudo instance labels by transferring the knowledge of WSSS to WSIS while eliminating the need for off-the-shelf proposals. Second, we propose a self-refinement method that refines the pseudo instance labels in a self-supervised scheme and employs them to the training in an online manner while resolving the semantic drift problem. The extensive experiments demonstrate the effectiveness of our approach, and we outperform existing works on PASCAL VOC2012 without any off-the-shelf proposal techniques. Furthermore, our approach can be easily applied to the point-supervised setting, boosting the performance with an economical annotation cost. The code will be available soon.
Score-based generative models (SGMs) have demonstrated remarkable synthesis quality. SGMs rely on a diffusion process that gradually perturbs the data towards a tractable distribution, while the generative model learns to denoise. The complexity of this denoising task is, apart from the data distribution itself, uniquely determined by the diffusion process. We argue that current SGMs employ overly simplistic diffusions, leading to unnecessarily complex denoising processes, which limit generative modeling performance. Based on connections to statistical mechanics, we propose a novel critically-damped Langevin diffusion (CLD) and show that CLD-based SGMs achieve superior performance. CLD can be interpreted as running a joint diffusion in an extended space, where the auxiliary variables can be considered "velocities" that are coupled to the data variables as in Hamiltonian dynamics. We derive a novel score matching objective for CLD and show that the model only needs to learn the score function of the conditional distribution of the velocity given data, an easier task than learning scores of the data directly. We also derive a new sampling scheme for efficient synthesis from CLD-based diffusion models. We find that CLD outperforms previous SGMs in synthesis quality for similar network architectures and sampling compute budgets. We show that our novel sampler for CLD significantly outperforms solvers such as Euler--Maruyama. Our framework provides new insights into score-based denoising diffusion models and can be readily used for high-resolution image synthesis. Project page and code: https://nv-tlabs.github.io/CLD-SGM.
Many existing region-of-attraction (ROA) analysis tools find difficulty in addressing feedback systems with large-scale neural network (NN) policies and/or high-dimensional sensing modalities such as cameras. In this paper, we tailor the projected gradient descent (PGD) attack method developed in the adversarial learning community as a general-purpose ROA analysis tool for large-scale nonlinear systems and end-to-end perception-based control. We show that the ROA analysis can be approximated as a constrained maximization problem whose goal is to find the worst-case initial condition which shifts the terminal state the most. Then we present two PGD-based iterative methods which can be used to solve the resultant constrained maximization problem. Our analysis is not based on Lyapunov theory, and hence requires minimum information of the problem structures. In the model-based setting, we show that the PGD updates can be efficiently performed using back-propagation. In the model-free setting (which is more relevant to ROA analysis of perception-based control), we propose a finite-difference PGD estimate which is general and only requires a black-box simulator for generating the trajectories of the closed-loop system given any initial state. We demonstrate the scalability and generality of our analysis tool on several numerical examples with large-scale NN policies and high-dimensional image observations. We believe that our proposed analysis serves as a meaningful initial step toward further understanding of closed-loop stability of large-scale nonlinear systems and perception-based control.
Medical Visual Question Answering (VQA) is a combination of medical artificial intelligence and popular VQA challenges. Given a medical image and a clinically relevant question in natural language, the medical VQA system is expected to predict a plausible and convincing answer. Although the general-domain VQA has been extensively studied, the medical VQA still needs specific investigation and exploration due to its task features. In the first part of this survey, we cover and discuss the publicly available medical VQA datasets up to date about the data source, data quantity, and task feature. In the second part, we review the approaches used in medical VQA tasks. In the last part, we analyze some medical-specific challenges for the field and discuss future research directions.
Recent deep learning-based methods have reconstructed a high dynamic range (HDR) image from a single low dynamic range (LDR) image by focusing on the exposure transfer task to reconstruct the multi-exposure stack. However, these methods often fail to fuse the multi-exposure stack into a perceptually pleasant HDR image as the local inversion artifacts are formed in the HDR imaging (HDRI) process. The artifacts arise from the impossibility of learning the whole HDRI process due to its non-differentiable structure of the camera response recovery. Therefore, we tackle the major challenge in stack reconstruction-based methods by proposing a novel framework with the fully differentiable HDRI process. Our framework enables a neural network to train the HDR image generation based on the end-to-end structure. Hence, a deep neural network can train the precise correlations between multi-exposure images in the HDRI process using our differentiable HDR synthesis layer. In addition, our network uses the image decomposition and the recursive process to facilitate the exposure transfer task and to adaptively respond to recursion frequency. The experimental results show that the proposed network outperforms the state-of-the-art quatitative and qualitative results in terms of both the exposure transfer tasks and the whole HDRI process.
Joint representation of geometry, colour and semantics using a 3D neural field enables accurate dense labelling from ultra-sparse interactions as a user reconstructs a scene in real-time using a handheld RGB-D sensor. Our iLabel system requires no training data, yet can densely label scenes more accurately than standard methods trained on large, expensively labelled image datasets. Furthermore, it works in an 'open set' manner, with semantic classes defined on the fly by the user. ILabel's underlying model is a multilayer perceptron (MLP) trained from scratch in real-time to learn a joint neural scene representation. The scene model is updated and visualised in real-time, allowing the user to focus interactions to achieve efficient labelling. A room or similar scene can be accurately labelled into 10+ semantic categories with only a few tens of clicks. Quantitative labelling accuracy scales powerfully with the number of clicks, and rapidly surpasses standard pre-trained semantic segmentation methods. We also demonstrate a hierarchical labelling variant.
With Deep Learning Image Classification becoming more powerful each year, it is apparent that its introduction to disaster response will increase the efficiency that responders can work with. Using several Neural Network Models, including AlexNet, ResNet, MobileNet, DenseNets, and 4-Layer CNN, we have classified flood disaster images from a large image data set with up to 79% accuracy. Our models and tutorials for working with the data set have created a foundation for others to classify other types of disasters contained in the images.
Super-resolution is a classical issue in image restoration field. In recent years, deep learning methods have achieved significant success in super-resolution topic, which concentrate on different elaborate network designs to exploit the image features more effectively. However, most of the networks focus on increasing the depth or width for superior capacities with a large number of parameters, which cause a high computation complexity cost and seldom focus on the inherent correlation of different features. This paper proposes a progressive multi-scale residual network (PMRN) for single image super-resolution problem by sequentially exploiting features with restricted parameters. Specifically, we design a progressive multi-scale residual block (PMRB) to progressively explore the multi-scale features with different layer combinations, aiming to consider the correlations of different scales. The combinations for feature exploitation are defined in a recursive fashion for introducing the non-linearity and better feature representation with limited parameters. Furthermore, we investigate a joint channel-wise and pixel-wise attention mechanism for comprehensive correlation exploration, termed as CPA, which is utilized in PMRB by considering both scale and bias factors for features in parallel. Experimental results show that proposed PMRN recovers structural textures more effectively with superior PSNR/SSIM results than other lightweight works. The extension model PMRN+ with self-ensemble achieves competitive or better results than large networks with much fewer parameters and lower computation complexity.
Pose estimation is the task of locating keypoints for an object of interest in an image. Animal Pose estimation is more challenging than estimating human pose due to high inter and intra class variability in animals. Existing works solve this problem for a fixed set of predefined animal categories. Models trained on such sets usually do not work well with new animal categories. Retraining the model on new categories makes the model overfit and leads to catastrophic forgetting. Thus, in this work, we propose a novel problem of "Incremental Learning for Animal Pose Estimation". Our method uses an exemplar memory, sampled using Determinantal Point Processes (DPP) to continually adapt to new animal categories without forgetting the old ones. We further propose a new variant of k-DPP that uses RBF kernel (termed as "RBF k-DPP") which gives more gain in performance over traditional k-DPP. Due to memory constraints, the limited number of exemplars along with new class data can lead to class imbalance. We mitigate it by performing image warping as an augmentation technique. This helps in crafting diverse poses, which reduces overfitting and yields further improvement in performance. The efficacy of our proposed approach is demonstrated via extensive experiments and ablations where we obtain significant improvements over state-of-the-art baseline methods.