Distinct scientific theories can make similar predictions. To adjudicate between theories, we must design experiments for which the theories make distinct predictions. Here we consider the problem of comparing deep neural networks as models of human visual recognition. To efficiently determine which models better explain human responses, we synthesize controversial stimuli: images for which different models produce distinct responses. We tested nine different models, which employed different architectures and recognition algorithms, including discriminative and generative models, all trained to recognize handwritten digits (from the MNIST set of digit images). We synthesized controversial stimuli to maximize the disagreement among the models. Human subjects viewed hundreds of these stimuli and judged the probability of presence of each digit in each image. We quantified how accurately each model predicted the human judgements. We found that the generative models (which learn the distribution of images for each class) better predicted the human judgments than the discriminative models (which learn to directly map from images to labels). The best performing model was the generative Analysis-by-Synthesis model (based on variational autoencoders). However, a simpler generative model (based on Gaussian-kernel-density estimation) also performed better than each of the discriminative models. None of the candidate models fully explained the human responses. We discuss the advantages and limitations of controversial stimuli as an experimental paradigm and how they generalize and improve on adversarial examples as probes of discrepancies between models and human perception.
Despite the remarkable success of generative adversarial networks, their performance seems less impressive for diverse training sets, requiring learning of discontinuous mapping functions. Though multi-mode prior or multi-generator models have been proposed to alleviate this problem, such approaches may fail depending on the empirically chosen initial mode components. In contrast to such bottom-up approaches, we present GAN-Tree, which follows a hierarchical divisive strategy to address such discontinuous multi-modal data. Devoid of any assumption on the number of modes, GAN-Tree utilizes a novel mode-splitting algorithm to effectively split the parent mode to semantically cohesive children modes, facilitating unsupervised clustering. Further, it also enables incremental addition of new data modes to an already trained GAN-Tree, by updating only a single branch of the tree structure. As compared to prior approaches, the proposed framework offers a higher degree of flexibility in choosing a large variety of mutually exclusive and exhaustive tree nodes called GAN-Set. Extensive experiments on synthetic and natural image datasets including ImageNet demonstrate the superiority of GAN-Tree against the prior state-of-the-arts.
In this paper, we examine the strength of deep learning technique for diagnosing lung cancer on medical image analysis problem. Convolutional neural networks (CNNs) models become popular among the pattern recognition and computer vision research area because of their promising outcome on generating high-level image representations. We propose a new deep learning architecture for learning high-level image representation to achieve high classification accuracy with low variance in medical image binary classification tasks. We aim to learn discriminant compact features at beginning of our deep convolutional neural network. We evaluate our model on Kaggle Data Science Bowl 2017 (KDSB17) data set, and compare it with some related works proposed in the Kaggle competition.
Deep neural networks frequently contain far more weights, represented at a higher precision, than are required for the specific task which they are trained to perform. Consequently, they can often be compressed using techniques such as weight pruning and quantization that reduce both model size and inference time without appreciable loss in accuracy. Compressing models before they are deployed can therefore result in significantly more efficient systems. However, while the results are desirable, finding the best compression strategy for a given neural network, target platform, and optimization objective often requires extensive experimentation. Moreover, finding optimal hyperparameters for a given compression strategy typically results in even more expensive, frequently manual, trial-and-error exploration. In this paper, we introduce a programmable system for model compression called Condensa. Users programmatically compose simple operators, in Python, to build complex compression strategies. Given a strategy and a user-provided objective, such as minimization of running time, Condensa uses a novel sample-efficient constrained Bayesian optimization algorithm to automatically infer desirable sparsity ratios. Our experiments on three real-world image classification and language modeling tasks demonstrate memory footprint reductions of up to 65x and runtime throughput improvements of up to 2.22x using at most 10 samples per search. We have released a reference implementation of Condensa at https://github.com/NVlabs/condensa.
Real-time scene parsing is a fundamental feature for autonomous driving vehicles with multiple cameras. Comparing with traditional methods which individually process the frames from each camera, in this letter we demonstrate that sharing semantics between cameras with overlapped views can boost the parsing performance. Our framework is based on a deep neural network for semantic segmentation but with two kinds of additional modules for sharing and fusing semantics. On one hand, a semantics sharing module is designed to establish the pixel-wise mapping between the input image pair. Features as well as semantics are shared by the map to reduce duplicated workload which leads to more efficient computation. On the other hand, feature fusion modules are designed to combine different modal of semantic features, which learns to leverage the information from both inputs for better results. To evaluate the effectiveness of the proposed framework, we collect a new dataset with a dual-camera vision system for driving scene parsing. Experimental results show that our network outperforms the baseline method on the parsing accuracy with comparable computations.
Existing visual explanation generating agents learn to fluently justify a class prediction. However, they may mention visual attributes which reflect a strong class prior, although the evidence may not actually be in the image. This is particularly concerning as ultimately such agents fail in building trust with human users. To overcome this limitation, we propose a phrase-critic model to refine generated candidate explanations augmented with flipped phrases which we use as negative examples while training. At inference time, our phrase-critic model takes an image and a candidate explanation as input and outputs a score indicating how well the candidate explanation is grounded in the image. Our explainable AI agent is capable of providing counter arguments for an alternative prediction, i.e. counterfactuals, along with explanations that justify the correct classification decisions. Our model improves the textual explanation quality of fine-grained classification decisions on the CUB dataset by mentioning phrases that are grounded in the image. Moreover, on the FOIL tasks, our agent detects when there is a mistake in the sentence, grounds the incorrect phrase and corrects it significantly better than other models.
Restoring a sharp light field image from its blurry input has become essential due to the increasing popularity of parallax-based image processing. State-of-the-art blind light field deblurring methods suffer from several issues such as slow processing, reduced spatial size, and a limited motion blur model. In this work, we address these challenging problems by generating a complex blurry light field dataset and proposing a learning-based deblurring approach. In particular, we model the full 6-degree of freedom (6-DOF) light field camera motion, which is used to create the blurry dataset using a combination of real light fields captured with a Lytro Illum camera, and synthetic light field renderings of 3D scenes. Furthermore, we propose a light field deblurring network that is built with the capability of large receptive fields. We also introduce a simple strategy of angular sampling to train on the large-scale blurry light field effectively. We evaluate our method through both quantitative and qualitative measurements and demonstrate superior performance compared to the state-of-the-art method with a massive speedup in execution time. Our method is about 16K times faster than Srinivasan et. al. [22] and can deblur a full-resolution light field in less than 2 seconds.
Digital medical images are always displayed scaled to fit particular view. Interpolation is responsible for this scaling, and if not done properly, can significantly degrade diagnostic image quality. However, theoretically-optimal interpolation algorithms may also be the most time-consuming and impractical. We propose a new approach, adapted to the needs of digital medical imaging, to combine high interpolation speed and superior L2-optimal image quality.
In this paper, we address the problem of weakly supervised object localization (WSL), which trains a detection network on the dataset with only image-level annotations. The proposed approach is built on the observation that the proposal set from the training dataset is a collection of background, object parts, and objects. Several strategies are taken to adaptively eliminate the noisy proposals and generate pseudo object-level annotations for the weakly labeled dataset. A multiple instance learning (MIL) algorithm enhanced by mask-out strategy is adopted to collect the class-specific object proposals, which are then utilized to adapt a pre-trained classification network to a detection network. In addition, the detection results from the detection network are re-weighted by jointly considering the detection scores and the overlap ratio of proposals in a proposal subset optimization framework. The optimal proposals work as object-level labels that enable a pseudo-strongly supervised dataset for training the detection network. Consequently, we establish a fully adaptive detection network. Extensive evaluations on the PASCAL VOC 2007 and 2012 datasets demonstrate a significant improvement compared with the state-of-the-art methods.
Given image labels as the only supervisory signal, we focus on harvesting, or mining, thoracic disease localizations from chest X-ray images. Harvesting such localizations from existing datasets allows for the creation of improved data sources for computer-aided diagnosis and retrospective analyses. We train a convolutional neural network (CNN) for image classification and propose an attention mining (AM) strategy to improve the model's sensitivity or saliency to disease patterns. The intuition of AM is that once the most salient disease area is blocked or hidden from the CNN model, it will pay attention to alternative image regions, while still attempting to make correct predictions. However, the model requires to be properly constrained during AM, otherwise, it may overfit to uncorrelated image parts and forget the valuable knowledge that it has learned from the original image classification task. To alleviate such side effects, we then design a knowledge preservation (KP) loss, which minimizes the discrepancy between responses for X-ray images from the original and the updated networks. Furthermore, we modify the CNN model to include multi-scale aggregation (MSA), improving its localization ability on small-scale disease findings, e.g., lung nodules. We experimentally validate our method on the publicly-available ChestX-ray14 dataset, outperforming a class activation map (CAM)-based approach, and demonstrating the value of our novel framework for mining disease locations.