Textual distractors in current multi-choice VQA datasets are not challenging enough for state-of-the-art neural models. To better assess whether well-trained VQA models are vulnerable to potential attack such as more challenging distractors, we introduce a novel task called \textit{textual Distractors Generation for VQA} (DG-VQA). The goal of DG-VQA is to generate the most confusing distractors in multi-choice VQA tasks represented as a tuple of image, question, and the correct answer. Consequently, such distractors expose the vulnerability of neural models. We show that distractor generation can be formulated as a Markov Decision Process, and present a reinforcement learning solution to unsupervised produce distractors. Our solution addresses the lack of large annotated corpus issue in classical distractor generation methods. Our proposed model receives reward signals from well-trained multi-choice VQA models and updates its parameters via policy gradient. The empirical results show that the generated textual distractors can successfully confuse several cutting-edge models with an average 20% accuracy drop from around 64%. Furthermore, we conduct extra adversarial training to improve the robustness of VQA models by incorporating the generated distractors. The experiment validates the effectiveness of adversarial training by showing a performance improvement of 27% for the multi-choice VQA task
Nowadays most research in visual recognition using Convolutional Neural Networks (CNNs) follows the "deeper model with deeper confidence" belief to gain a higher recognition accuracy. At the same time, deeper model brings heavier computation. On the other hand, for a large chunk of recognition challenges, a system can classify images correctly using simple models or so-called shallow networks. Moreover, the implementation of CNNs faces with the size, weight, and energy constraints on the embedded devices. In this paper, we implement the adaptive switching between shallow and deep networks to reach the highest throughput on a resource-constrained MPSoC with CPU and FPGA. To this end, we develop and present a novel architecture for the CNNs where a gate makes the decision whether using the deeper model is beneficial or not. Due to resource limitation on FPGA, the idea of partial reconfiguration has been used to accommodate deep CNNs on the FPGA resources. We report experimental results on CIFAR-10, CIFAR-100, and SVHN datasets to validate our approach. Using confidence metric as the decision making factor, only 69.8%, 71.8%, and 43.8% of the computation in the deepest network is done for CIFAR-10, CIFAR-100, and SVHN while it can maintain the desired accuracy with the throughput of around 400 images per second for SVHN dataset.
Deep learning based data-driven approaches have been successfully applied in various image understanding applications ranging from object recognition, semantic segmentation to visual question answering. However, the lack of knowledge integration as well as higher-level reasoning capabilities with the methods still pose a hindrance. In this work, we present a brief survey of a few representative reasoning mechanisms, knowledge integration methods and their corresponding image understanding applications developed by various groups of researchers, approaching the problem from a variety of angles. Furthermore, we discuss upon key efforts on integrating external knowledge with neural networks. Taking cues from these efforts, we conclude by discussing potential pathways to improve reasoning capabilities.
The process of identifying changes or transformations in a scene along with the ability of reasoning about their causes and effects, is a key aspect of intelligence. In this work we go beyond recent advances in computational perception, and introduce a more challenging task, Image-based Event-Sequencing (IES). In IES, the task is to predict a sequence of actions required to rearrange objects from the configuration in an input source image to the one in the target image. IES also requires systems to possess inductive generalizability. Motivated from evidence in cognitive development, we compile the first IES dataset, the Blocksworld Image Reasoning Dataset (BIRD) which contains images of wooden blocks in different configurations, and the sequence of moves to rearrange one configuration to the other. We first explore the use of existing deep learning architectures and show that these end-to-end methods under-perform in inferring temporal event-sequences and fail at inductive generalization. We then propose a modular two-step approach: Visual Perception followed by Event-Sequencing, and demonstrate improved performance by combining learning and reasoning. Finally, by showing an extension of our approach on natural images, we seek to pave the way for future research on event sequencing for real world scenes.
Confocal laser endomicroscopy (CLE) allow on-the-fly in vivo intraoperative imaging in a discreet field of view, especially for brain tumors, rather than extracting tissue for examination ex vivo with conventional light microscopy. Fluorescein sodium-driven CLE imaging is more interactive, rapid, and portable than conventional hematoxylin and eosin (H&E)-staining. However, it has several limitations: CLE images may be contaminated with artifacts (motion, red blood cells, noise), and neuropathologists are mainly trained on colorful stained histology slides like H&E while the CLE images are gray. To improve the diagnostic quality of CLE, we used a micrograph of an H&E slide from a glioma tumor biopsy and image style transfer, a neural network method for integrating the content and style of two images. This was done through minimizing the deviation of the target image from both the content (CLE) and style (H&E) images. The style transferred images were assessed and compared to conventional H&E histology by neurosurgeons and a neuropathologist who then validated the quality enhancement in 100 pairs of original and transformed images. Average reviewers' score on test images showed 84 out of 100 transformed images had fewer artifacts and more noticeable critical structures compared to their original CLE form. By providing images that are more interpretable than the original CLE images and more rapidly acquired than H&E slides, the style transfer method allows a real-time, cellular-level tissue examination using CLE technology that closely resembles the conventional appearance of H&E staining and may yield better diagnostic recognition than original CLE grayscale images.
Given a mapped environment, we formulate the problem of visually tracking and following an evader using a probabilistic framework. In this work, we consider a non-holonomic robot with a limited visibility depth sensor in an indoor environment with obstacles. The mobile robot that follows the target is considered a pursuer and the agent being followed is considered an evader. We propose a probabilistic framework for both the pursuer and evader to achieve their conflicting goals. We introduce a smart evader that has information about the location of the pursuer. The goal of this variant of the evader is to avoid being tracked by the pursuer by using the visibility region information obtained from the pursuer, to further challenge the proposed smart pursuer. To validate the efficiency of the framework, we conduct several experiments in simulation by using Gazebo and evaluate the success rate of tracking an evader in various environments with different pursuer to evader speed ratios. Through our experiments we validate our hypothesis that a smart pursuer tracks an evader more effectively than a pursuer that just navigates in the environment randomly. We also validate that an evader that is aware of the actions of the pursuer is more successful at avoiding getting tracked by a smart pursuer than a random evader. Finally, we empirically show that while a smart pursuer does increase it's average success rate of tracking compared to a random pursuer, there is an increased variance in its success rate distribution when the evader becomes aware of its actions.
Computer Vision applications often require a textual grounding module with precision, interpretability, and resilience to counterfactual inputs/queries. To achieve high grounding precision, current textual grounding methods heavily rely on large-scale training data with manual annotations at the pixel level. Such annotations are expensive to obtain and thus severely narrow the model's scope of real-world applications. Moreover, most of these methods sacrifice interpretability, generalizability, and they neglect the importance of being resilient to counterfactual inputs. To address these issues, we propose a visual grounding system which is 1) end-to-end trainable in a weakly supervised fashion with only image-level annotations, and 2) counterfactually resilient owing to the modular design. Specifically, we decompose textual descriptions into three levels: entity, semantic attribute, color information, and perform compositional grounding progressively. We validate our model through a series of experiments and demonstrate its improvement over the state-of-the-art methods. In particular, our model's performance not only surpasses other weakly/un-supervised methods and even approaches the strongly supervised ones, but also is interpretable for decision making and performs much better in face of counterfactual classes than all the others.
We study the problem of learning a generalizable action policy for an intelligent agent to actively approach an object of interest in an indoor environment solely from its visual inputs. While scene-driven or recognition-driven visual navigation has been widely studied, prior efforts suffer severely from the limited generalization capability. In this paper, we first argue the object searching task is environment dependent while the approaching ability is general. To learn a generalizable approaching policy, we present a novel solution dubbed as GAPLE which adopts two channels of visual features: depth and semantic segmentation, as the inputs to the policy learning module. The empirical studies conducted on the House3D dataset as well as on a physical platform in a real world scenario validate our hypothesis, and we further provide in-depth qualitative analysis.
Deep neural networks based methods have been proved to achieve outstanding performance on object detection and classification tasks. Despite significant performance improvement, due to the deep structures, they still require prohibitive runtime to process images and maintain the highest possible performance for real-time applications. Observing the phenomenon that human vision system (HVS) relies heavily on the temporal dependencies among frames from the visual input to conduct recognition efficiently, we propose a novel framework dubbed as TKD: temporal knowledge distillation. This framework distills the temporal knowledge from a heavy neural networks based model over selected video frames (the perception of the moments) to a light-weight model. To enable the distillation, we put forward two novel procedures: 1) an Long-short Term Memory (LSTM) based key frame selection method; and 2) a novel teacher-bounded loss design. To validate, we conduct comprehensive empirical evaluations using different object detection methods over multiple datasets including Youtube-Objects and Hollywood scene dataset. Our results show consistent improvement in accuracy-speed trad-offs for object detection over the frames of the dynamic scene, compare to other modern object recognition methods.
We demonstrate in this paper that a generative model can be designed to perform classification tasks under challenging settings, including adversarial attacks and input distribution shifts. Specifically, we propose a conditional variational autoencoder that learns both the decomposition of inputs and the distributions of the resulting components. During test, we jointly optimize the latent variables of the generator and the relaxed component labels to find the best match between the given input and the output of the generator. The model demonstrates promising performance at recognizing overlapping components from the multiMNIST dataset, and novel component combinations from a traffic sign dataset. Experiments also show that the proposed model achieves high robustness on MNIST and NORB datasets, in particular for high-strength gradient attacks and non-gradient attacks.