Vulnerability of neural networks under adversarial attacks has raised serious concerns and extensive research. Recent studies suggested that model robustness relies on the use of robust features, i.e., features with strong correlation with labels, and that data dimensionality and distribution affect the learning of robust features. On the other hand, experiments showed that human vision, which is robust against adversarial attacks, is invariant to natural input transformations. Drawing on these findings, this paper investigates whether constraints on transformation invariance, including image cropping, rotation, and zooming, will force image classifiers to learn and use robust features and in turn acquire better robustness. Experiments on MNIST and CIFAR10 show that transformation invariance alone has limited effect. Nonetheless, models adversarially trained on cropping-invariant attacks, in particular, can (1) extract more robust features, (2) have significantly better robustness than the state-of-the-art models from adversarial training, and (3) require less training data.
While intelligence of autonomous vehicles (AVs) has significantly advanced in recent years, accidents involving AVs suggest that these autonomous systems lack gracefulness in driving when interacting with human drivers. In the setting of a two-player game, we propose model predictive control based on social gracefulness, which is measured by the discrepancy between the actions taken by the AV and those that could have been taken in favor of the human driver. We define social awareness as the ability of an agent to infer such favorable actions based on knowledge about the other agent's intent, and further show that empathy, i.e., the ability to understand others' intent by simultaneously inferring others' understanding of the agent's self intent, is critical to successful intent inference. Lastly, through an intersection case, we show that the proposed gracefulness objective allows an AV to learn more sophisticated behavior, such as passive-aggressive motions that gently force the other agent to yield.
For tasks involving language and vision, the current state-of-the-art methods tend not to leverage any additional information that might be present to gather relevant (commonsense) knowledge. A representative task is Visual Question Answering where large diagnostic datasets have been proposed to test a system's capability of answering questions about images. The training data is often accompanied by annotations of individual object properties and spatial locations. In this work, we take a step towards integrating this additional privileged information in the form of spatial knowledge to aid in visual reasoning. We propose a framework that combines recent advances in knowledge distillation (teacher-student framework), relational reasoning and probabilistic logical languages to incorporate such knowledge in existing neural networks for the task of Visual Question Answering. Specifically, for a question posed against an image, we use a probabilistic logical language to encode the spatial knowledge and the spatial understanding about the question in the form of a mask that is directly provided to the teacher network. The student network learns from the ground-truth information as well as the teachers prediction via distillation. We also demonstrate the impact of predicting such a mask inside the teachers network using attention. Empirically, we show that both the methods improve the test accuracy over a state-of-the-art approach on a publicly available dataset.
The seminal work of Gatys et al. demonstrated the power of Convolutional Neural Networks (CNNs) in creating artistic imagery by separating and recombining image content and style. This process of using CNNs to render a content image in different styles is referred to as Neural Style Transfer (NST). Since then, NST has become a trending topic both in academic literature and industrial applications. It is receiving increasing attention and a variety of approaches are proposed to either improve or extend the original NST algorithm. In this paper, we aim to provide a comprehensive overview of the current progress towards NST. We first propose a taxonomy of current algorithms in the field of NST. Then, we present several evaluation methods and compare different NST algorithms both qualitatively and quantitatively. The review concludes with a discussion of various applications of NST and open problems for future research. A list of papers discussed in this review, corresponding codes, pre-trained models and more comparison results are publicly available at https://github.com/ycjing/Neural-Style-Transfer-Papers.
The Fast Style Transfer methods have been recently proposed to transfer a photograph to an artistic style in real-time. This task involves controlling the stroke size in the stylized results, which remains an open challenge. In this paper, we present a stroke controllable style transfer network that can achieve continuous and spatial stroke size control. By analyzing the factors that influence the stroke size, we propose to explicitly account for the receptive field and the style image scales. We propose a StrokePyramid module to endow the network with adaptive receptive fields, and two training strategies to achieve faster convergence and augment new stroke sizes upon a trained model respectively. By combining the proposed runtime control strategies, our network can achieve continuous changes in stroke sizes and produce distinct stroke sizes in different spatial regions within the same output image.
We study the problem of learning a generalizable action policy for an intelligent agent to actively approach an object of interest in indoor environment solely from its visual inputs. While scene-driven or recognition-driven visual navigation has been widely studied, prior efforts suffer severely from the limited generalization capability. In this paper, we first argue the object searching task is environment dependent while the approaching ability is general. To learn a generalizable approaching policy, we present a novel solution dubbed as GAPLE which adopts two channels of visual features: depth and semantic segmentation, as the inputs to the policy learning module. The empirical studies conducted on the House3D dataset as well as on a physical platform in a real world scenario validate our hypothesis, and we further provide in-depth qualitative analysis.
Confocal laser endomicroscopy (CLE) is an advanced optical fluorescence imaging technology that has the potential to increase intraoperative precision, extend resection, and tailor surgery for malignant invasive brain tumors because of its subcellular dimension resolution. Despite its promising diagnostic potential, interpreting the gray tone fluorescence images can be difficult for untrained users. In this review, we provide a detailed description of bioinformatical analysis methodology of CLE images that begins to assist the neurosurgeon and pathologist to rapidly connect on-the-fly intraoperative imaging, pathology, and surgical observation into a conclusionary system within the concept of theranostics. We present an overview and discuss deep learning models for automatic detection of the diagnostic CLE images and discuss various training regimes and ensemble modeling effect on the power of deep learning predictive models. Two major approaches reviewed in this paper include the models that can automatically classify CLE images into diagnostic/nondiagnostic, glioma/nonglioma, tumor/injury/normal categories and models that can localize histological features on the CLE images using weakly supervised methods. We also briefly review advances in the deep learning approaches used for CLE image analysis in other organs. Significant advances in speed and precision of automated diagnostic frame selection would augment the diagnostic potential of CLE, improve operative workflow and integration into brain tumor surgery. Such technology and bioinformatics analytics lend themselves to improved precision, personalization, and theranostics in brain tumor treatment.
Confocal Laser Endomicroscope (CLE) is a novel handheld fluorescence imaging device that has shown promise for rapid intraoperative diagnosis of brain tumor tissue. Currently CLE is capable of image display only and lacks an automatic system to aid the surgeon in analyzing the images. The goal of this project was to develop a computer-aided diagnostic approach for CLE imaging of human glioma with feature localization function. Despite the tremendous progress in object detection and image segmentation methods in recent years, most of such methods require large annotated datasets for training. However, manual annotation of thousands of histopathological images by physicians is costly and time consuming. To overcome this problem, we propose a Weakly-Supervised Learning (WSL)-based model for feature localization that trains on image-level annotations, and then localizes incidences of a class-of-interest in the test image. We developed a novel convolutional neural network for diagnostic features localization from CLE images by employing a novel multiscale activation map that is laterally inhibited and collaterally integrated. To validate our method, we compared proposed model's output to the manual annotation performed by four neurosurgeons on test images. Proposed model achieved 88% mean accuracy and 86% mean intersection over union on intermediate features and 87% mean accuracy and 88% mean intersection over union on restrictive fine features, while outperforming other state of the art methods tested. This system can improve accuracy and efficiency in characterization of CLE images of glioma tissue during surgery, augment intraoperative decision-making process regarding tumor margin and affect resection rates.
We study the problem of learning a navigation policy for a robot to actively search for an object of interest in an indoor environment solely from its visual inputs. While scene-driven visual navigation has been widely studied, prior efforts on learning navigation policies for robots to find objects are limited. The problem is often more challenging than target scene finding as the target objects can be very small in the view and can be in an arbitrary pose. We approach the problem from an active perceiver perspective, and propose a novel framework that integrates a deep neural network based object recognition module and a deep reinforcement learning based action prediction mechanism. To validate our method, we conduct experiments on both a simulation dataset (AI2-THOR) and a real-world environment with a physical robot. We further propose a new decaying reward function to learn the control policy specific to the object searching task. Experimental results validate the efficacy of our method, which outperforms competing methods in both average trajectory length and success rate.
Intelligent fashion outfit composition becomes more and more popular in these years. Some deep learning based approaches reveal competitive composition recently. However, the unexplainable characteristic makes such deep learning based approach cannot meet the the designer, businesses and consumers' urge to comprehend the importance of different attributes in an outfit composition. To realize interpretable and customized fashion outfit compositions, we propose a partitioned embedding network to learn interpretable representations from clothing items. The overall network architecture consists of three components: an auto-encoder module, a supervised attributes module and a multi-independent module. The auto-encoder module serves to encode all useful information into the embedding. In the supervised attributes module, multiple attributes labels are adopted to ensure that different parts of the overall embedding correspond to different attributes. In the multi-independent module, adversarial operation are adopted to fulfill the mutually independent constraint. With the interpretable and partitioned embedding, we then construct an outfit composition graph and an attribute matching map. Given specified attributes description, our model can recommend a ranked list of outfit composition with interpretable matching scores. Extensive experiments demonstrate that 1) the partitioned embedding have unmingled parts which corresponding to different attributes and 2) outfits recommended by our model are more desirable in comparison with the existing methods.