Deep learning is a data-hungry approach, which requires massive training data. However, it is time-consuming and labor-intensive to collect abundant fully-annotated training data for all categories. Assuming the existence of base categories with adequate fully-annotated training samples, different paradigms requiring fewer training samples or weaker annotations for novel categories have attracted growing research interest. Among them, zero-shot (resp., few-shot) learning explores using zero (resp., a few) training samples for novel categories, which lowers the quantity requirement for novel categories. Instead, weak-shot learning lowers the quality requirement for novel categories. Specifically, sufficient training samples are collected for novel categories but they only have weak annotations. In different tasks, weak annotations are presented in different forms (e.g., noisy labels for image classification, image labels for object detection, bounding boxes for segmentation), similar to the definitions in weakly supervised learning. Therefore, weak-shot learning can also be treated as weakly supervised learning with auxiliary fully supervised categories. In this paper, we discuss the existing weak-shot learning methodologies in different tasks and summarize the codes at https://github.com/bcmi/Awesome-Weak-Shot-Learning.
We propose an enhanced multi-scale network, dubbed GridDehazeNet+, for single image dehazing. It consists of three modules: pre-processing, backbone, and post-processing. The trainable pre-processing module can generate learned inputs with better diversity and more pertinent features as compared to those derived inputs produced by hand-selected pre-processing methods. The backbone module implements multi-scale estimation with two major enhancements: 1) a novel grid structure that effectively alleviates the bottleneck issue via dense connections across different scales; 2) a spatial-channel attention block that can facilitate adaptive fusion by consolidating dehazing-relevant features. The post-processing module helps to reduce the artifacts in the final output. To alleviate domain shift between network training and testing, we convert synthetic data to so-called translated data with the distribution shaped to match that of real data. Moreover, to further improve the dehazing performance in real-world scenarios, we propose a novel intra-task knowledge transfer mechanism that leverages the distilled knowledge from synthetic data to assist the learning process on translated data. Experimental results indicate that the proposed GridDehazeNet+ outperforms the state-of-the-art methods on several dehazing benchmarks. The proposed dehazing method does not rely on the atmosphere scattering model, and we provide a possible explanation as to why it is not necessarily beneficial to take advantage of the dimension reduction offered by this model, even if only the dehazing results on synthetic images are concerned.
Ultrasound imaging (US) often suffers from distinct image artifacts from various sources. Classic approaches for solving these problems are usually model-based iterative approaches that have been developed specifically for each type of artifact, which are often computationally intensive. Recently, deep learning approaches have been proposed as computationally efficient and high performance alternatives. Unfortunately, in the current deep learning approaches, a dedicated neural network should be trained with matched training data for each specific artifact type. This poses a fundamental limitation in the practical use of deep learning for US, since large number of models should be stored to deal with various US image artifacts. Inspired by the recent success of multi-domain image transfer, here we propose a novel, unsupervised, deep learning approach in which a single neural network can be used to deal with different types of US artifacts simply by changing a mask vector that switches between different target domains. Our algorithm is rigorously derived using an optimal transport (OT) theory for cascaded probability measures. Experimental results using phantom and in vivo data demonstrate that the proposed method can generate high quality image by removing distinct artifacts, which are comparable to those obtained by separately trained multiple neural networks.
Self-attention (SA) network has shown profound value in image captioning. In this paper, we improve SA from two aspects to promote the performance of image captioning. First, we propose Normalized Self-Attention (NSA), a reparameterization of SA that brings the benefits of normalization inside SA. While normalization is previously only applied outside SA, we introduce a novel normalization method and demonstrate that it is both possible and beneficial to perform it on the hidden activations inside SA. Second, to compensate for the major limit of Transformer that it fails to model the geometry structure of the input objects, we propose a class of Geometry-aware Self-Attention (GSA) that extends SA to explicitly and efficiently consider the relative geometry relations between the objects in the image. To construct our image captioning model, we combine the two modules and apply it to the vanilla self-attention network. We extensively evaluate our proposals on MS-COCO image captioning dataset and superior results are achieved when comparing to state-of-the-art approaches. Further experiments on three challenging tasks, i.e. video captioning, machine translation, and visual question answering, show the generality of our methods.
Semi-supervised video action recognition tends to enable deep neural networks to achieve remarkable performance even with very limited labeled data. However, existing methods are mainly transferred from current image-based methods (e.g., FixMatch). Without specifically utilizing the temporal dynamics and inherent multimodal attributes, their results could be suboptimal. To better leverage the encoded temporal information in videos, we introduce temporal gradient as an additional modality for more attentive feature extraction in this paper. To be specific, our method explicitly distills the fine-grained motion representations from temporal gradient (TG) and imposes consistency across different modalities (i.e., RGB and TG). The performance of semi-supervised action recognition is significantly improved without additional computation or parameters during inference. Our method achieves the state-of-the-art performance on three video action recognition benchmarks (i.e., Kinetics-400, UCF-101, and HMDB-51) under several typical semi-supervised settings (i.e., different ratios of labeled data).
Medical image segmentation can provide a reliable basis for further clinical analysis and disease diagnosis. The performance of medical image segmentation has been significantly advanced with the convolutional neural networks (CNNs). However, most existing CNNs-based methods often produce unsatisfactory segmentation mask without accurate object boundaries. This is caused by the limited context information and inadequate discriminative feature maps after consecutive pooling and convolution operations. In that the medical image is characterized by the high intra-class variation, inter-class indistinction and noise, extracting powerful context and aggregating discriminative features for fine-grained segmentation are still challenging today. In this paper, we formulate a boundary-aware context neural network (BA-Net) for 2D medical image segmentation to capture richer context and preserve fine spatial information. BA-Net adopts encoder-decoder architecture. In each stage of encoder network, pyramid edge extraction module is proposed for obtaining edge information with multiple granularities firstly. Then we design a mini multi-task learning module for jointly learning to segment object masks and detect lesion boundaries. In particular, a new interactive attention is proposed to bridge two tasks for achieving information complementarity between different tasks, which effectively leverages the boundary information for offering a strong cue to better segmentation prediction. At last, a cross feature fusion module aims to selectively aggregate multi-level features from the whole encoder network. By cascaded three modules, richer context and fine-grain features of each stage are encoded. Extensive experiments on five datasets show that the proposed BA-Net outperforms state-of-the-art approaches.
Image interpolation in medical area is of high importance as most 3D biomedical volume images are sampled where the distance between consecutive slices significantly greater than the in-plane pixel size due to radiation dose or scanning time. Image interpolation creates a number of new slices between known slices in order to obtain an isotropic volume image. The results can be used for the higher quality of 3D reconstruction and visualization of human body structures. Semantic interpolation on the manifold has been proved to be very useful for smoothing image interpolation. Nevertheless, all previous methods focused on low-resolution image interpolation, and most of them work poorly on high-resolution image. We propose a novel network, High Resolution Interpolation Network (HRINet), aiming at producing high-resolution CT image interpolations. We combine the idea of ACAI and GANs, and propose a novel idea of alternative supervision method by applying supervised and unsupervised training alternatively to raise the accuracy of human organ structures in CT while keeping high quality. We compare an MSE based and a perceptual based loss optimizing methods for high quality interpolation, and show the tradeoff between the structural correctness and sharpness. Our experiments show the great improvement on 256 2 and 5122 images quantitatively and qualitatively.
In this work, we adhere to explore a Multi-Tasking learning (MTL) based network to perform document attribute classification such as the font type, font size, font emphasis and scanning resolution classification of a document image. To accomplish these tasks, we operate on either segmented word level or on uniformed size patches randomly cropped out of the document. Furthermore, a hybrid convolution neural network (CNN) architecture "MTL+MI", which is based on the combination of MTL and Multi-Instance (MI) of patch and word is used to accomplish joint learning for the classification of the same document attributes. The contribution of this paper are three fold: firstly, based on segmented word images and patches, we present a MTL based network for the classification of a full document image. Secondly, we propose a MTL and MI (using segmented words and patches) based combined CNN architecture ("MTL+MI") for the classification of same document attributes. Thirdly, based on the multi-tasking classifications of the words and/or patches, we propose an intelligent voting system which is based on the posterior probabilities of each words and/or patches to perform the classification of document's attributes of complete document image.
Convolutional neural networks (CNNs) have been widely used for hyperspectral image classification. As a common process, small cubes are firstly cropped from the hyperspectral image and then fed into CNNs to extract spectral and spatial features. It is well known that different spectral bands and spatial positions in the cubes have different discriminative abilities. If fully explored, this prior information will help improve the learning capacity of CNNs. Along this direction, we propose an attention aided CNN model for spectral-spatial classification of hyperspectral images. Specifically, a spectral attention sub-network and a spatial attention sub-network are proposed for spectral and spatial classification, respectively. Both of them are based on the traditional CNN model, and incorporate attention modules to aid networks focus on more discriminative channels or positions. In the final classification phase, the spectral classification result and the spatial classification result are combined together via an adaptively weighted summation method. To evaluate the effectiveness of the proposed model, we conduct experiments on three standard hyperspectral datasets. The experimental results show that the proposed model can achieve superior performance compared to several state-of-the-art CNN-related models.
Deep learning based methods hold state-of-the-art results in low-level image processing tasks, but remain difficult to interpret due to their black-box construction. Unrolled optimization networks present an interpretable alternative to constructing deep neural networks by deriving their architecture from classical iterative optimization methods without use of tricks from the standard deep learning tool-box. So far, such methods have demonstrated performance close to that of state-of-the-art models while using their interpretable construction to achieve a comparably low learned parameter count. In this work, we propose an unrolled convolutional dictionary learning network (CDLNet) and demonstrate its competitive denoising and joint denoising and demosaicing (JDD) performance both in low and high parameter count regimes. Specifically, we show that the proposed model outperforms state-of-the-art fully convolutional denoising and JDD models when scaled to a similar parameter count. In addition, we leverage the model's interpretable construction to propose a noise-adaptive parameterization of thresholds in the network that enables state-of-the-art blind denoising performance, and near perfect generalization on noise-levels unseen during training. Furthermore, we show that such performance extends to the JDD task and unsupervised learning.