Textual distractors in current multi-choice VQA datasets are not challenging enough for state-of-the-art neural models. To better assess whether well-trained VQA models are vulnerable to potential attack such as more challenging distractors, we introduce a novel task called \textit{textual Distractors Generation for VQA} (DG-VQA). The goal of DG-VQA is to generate the most confusing distractors in multi-choice VQA tasks represented as a tuple of image, question, and the correct answer. Consequently, such distractors expose the vulnerability of neural models. We show that distractor generation can be formulated as a Markov Decision Process, and present a reinforcement learning solution to unsupervised produce distractors. Our solution addresses the lack of large annotated corpus issue in classical distractor generation methods. Our proposed model receives reward signals from well-trained multi-choice VQA models and updates its parameters via policy gradient. The empirical results show that the generated textual distractors can successfully confuse several cutting-edge models with an average 20% accuracy drop from around 64%. Furthermore, we conduct extra adversarial training to improve the robustness of VQA models by incorporating the generated distractors. The experiment validates the effectiveness of adversarial training by showing a performance improvement of 27% for the multi-choice VQA task
The main purpose of incremental learning is to learn new knowledge while not forgetting the knowledge which have been learned before. At present, the main challenge in this area is the catastrophe forgetting, namely the network will lose their performance in the old tasks after training for new tasks. In this paper, we introduce an ensemble method of incremental classifier to alleviate this problem, which is based on the cosine distance between the output feature and the pre-defined center, and can let each task to be preserved in different networks. During training, we make use of PEDCC-Loss to train the CNN network. In the stage of testing, the prediction is determined by the cosine distance between the network latent features and pre-defined center. The experimental results on EMINST and CIFAR100 show that our method outperforms the recent LwF method, which use the knowledge distillation, and iCaRL method, which keep some old samples while training for new task. The method can achieve the goal of not forgetting old knowledge while training new classes, and solve the problem of catastrophic forgetting better.
With the development of convolutional neural networks (CNNs) in recent years, the network structure has become more and more complex and varied, and has achieved very good results in pattern recognition, image classification, object detection and tracking. For CNNs used for image classification, in addition to the network structure, more and more research is now focusing on the improvement of the loss function, so as to enlarge the inter-class feature differences, and reduce the intra-class feature variations as soon as possible. Besides the traditional Softmax, typical loss functions include L-Softmax, AM-Softmax, ArcFace, and Center loss, etc. Based on the concept of predefined evenly-distributed class centroids (PEDCC) in CSAE network, this paper proposes a PEDCC-based loss function called PEDCC-Loss, which can make the inter-class distance maximal and intra-class distance small enough in hidden feature space. Multiple experiments on image classification and face recognition have proved that our method achieve the best recognition accuracy, and network training is stable and easy to converge. Code is available in https://github.com/ZLeopard/PEDCC-Loss
We study the problem of learning a generalizable action policy for an intelligent agent to actively approach an object of interest in an indoor environment solely from its visual inputs. While scene-driven or recognition-driven visual navigation has been widely studied, prior efforts suffer severely from the limited generalization capability. In this paper, we first argue the object searching task is environment dependent while the approaching ability is general. To learn a generalizable approaching policy, we present a novel solution dubbed as GAPLE which adopts two channels of visual features: depth and semantic segmentation, as the inputs to the policy learning module. The empirical studies conducted on the House3D dataset as well as on a physical platform in a real world scenario validate our hypothesis, and we further provide in-depth qualitative analysis.
We study the problem of learning a navigation policy for a robot to actively search for an object of interest in an indoor environment solely from its visual inputs. While scene-driven visual navigation has been widely studied, prior efforts on learning navigation policies for robots to find objects are limited. The problem is often more challenging than target scene finding as the target objects can be very small in the view and can be in an arbitrary pose. We approach the problem from an active perceiver perspective, and propose a novel framework that integrates a deep neural network based object recognition module and a deep reinforcement learning based action prediction mechanism. To validate our method, we conduct experiments on both a simulation dataset (AI2-THOR) and a real-world environment with a physical robot. We further propose a new decaying reward function to learn the control policy specific to the object searching task. Experimental results validate the efficacy of our method, which outperforms competing methods in both average trajectory length and success rate.
The growth of the number of people in the monitoring scene may increase the probability of security threat, which makes crowd counting more and more important. Most of the existing approaches estimate the number of pedestrians within one frame, which results in inconsistent predictions in terms of time. This paper, for the first time, introduces a quadratic programming model with the network flow constraints to improve the accuracy of crowd counting. Firstly, the foreground of each frame is segmented into groups, each of which contains several pedestrians. Then, a regression-based map is developed in accordance with the relationship between low-level features of each group and the number of people in it. Secondly, a directed graph is constructed to simulate constraints on people's flow, whose vertices represent groups of each frame and arcs represent people moving from one group to another. Then, the people flow can be viewed as an integer flow in the constructed digraph. Finally, by solving a quadratic programming problem with network flow constraints in the directed graph, we obtain consistency in people counting. The experimental results show that the proposed method can reduce the crowd counting errors and improve the accuracy. Moreover, this method can also be applied to any ultramodern group-based regression counting approach to get improvements.