Alert button
Picture for Bailin Li

Bailin Li

Alert button

Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding

Dec 14, 2020
Qingxing Cao, Bailin Li, Xiaodan Liang, Keze Wang, Liang Lin

Figure 1 for Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding
Figure 2 for Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding
Figure 3 for Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding
Figure 4 for Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding

Though beneficial for encouraging the Visual Question Answering (VQA) models to discover the underlying knowledge by exploiting the input-output correlation beyond image and text contexts, the existing knowledge VQA datasets are mostly annotated in a crowdsource way, e.g., collecting questions and external reasons from different users via the internet. In addition to the challenge of knowledge reasoning, how to deal with the annotator bias also remains unsolved, which often leads to superficial over-fitted correlations between questions and answers. To address this issue, we propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation. Considering that a desirable VQA model should correctly perceive the image context, understand the question, and incorporate its learned knowledge, our proposed dataset aims to cutoff the shortcut learning exploited by the current deep embedding models and push the research boundary of the knowledge-based visual question reasoning. Specifically, we generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs to disentangle the knowledge from other biases. The programs can select one or two triplets from the scene graph or knowledge base to push multi-step reasoning, avoid answer ambiguity, and balanced the answer distribution. In contrast to the existing VQA datasets, we further imply the following two major constraints on the programs to incorporate knowledge reasoning: i) multiple knowledge triplets can be related to the question, but only one knowledge relates to the image object. This can enforce the VQA model to correctly perceive the image instead of guessing the knowledge based on the given question solely; ii) all questions are based on different knowledge, but the candidate answers are the same for both the training and test sets.

* To appear in TNNLS 2021. Considering that a desirable VQA model should correctly perceive the image context, understand the question, and incorporate its learned knowledge, our proposed dataset aims to cutoff the shortcut learning exploited by the current deep embedding models and push the research boundary of the knowledge-based visual question reasoning 
Viaarxiv icon

EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning

Jul 06, 2020
Bailin Li, Bowen Wu, Jiang Su, Guangrun Wang, Liang Lin

Figure 1 for EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning
Figure 2 for EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning
Figure 3 for EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning
Figure 4 for EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning

Finding out the computational redundant part of a trained Deep Neural Network (DNN) is the key question that pruning algorithms target on. Many algorithms try to predict model performance of the pruned sub-nets by introducing various evaluation methods. But they are either inaccurate or very complicated for general application. In this work, we present a pruning method called EagleEye, in which a simple yet efficient evaluation component based on adaptive batch normalization is applied to unveil a strong correlation between different pruned DNN structures and their final settled accuracy. This strong correlation allows us to fast spot the pruned candidates with highest potential accuracy without actually fine-tuning them. This module is also general to plug-in and improve some existing pruning algorithms. EagleEye achieves better pruning performance than all of the studied pruning algorithms in our experiments. Concretely, to prune MobileNet V1 and ResNet-50, EagleEye outperforms all compared methods by up to 3.8%. Even in the more challenging experiments of pruning the compact model of MobileNet V1, EagleEye achieves the highest accuracy of 70.9% with an overall 50% operations (FLOPs) pruned. All accuracy results are Top-1 ImageNet classification accuracy. Source code and models are accessible to open-source community https://github.com/anonymous47823493/EagleEye .

* Accepted in ECCV 2020(Oral). Codes are available on https://github.com/anonymous47823493/EagleEye 
Viaarxiv icon

Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network

Sep 23, 2019
Qingxing Cao, Bailin Li, Xiaodan Liang, Liang Lin

Figure 1 for Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network
Figure 2 for Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network
Figure 3 for Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network
Figure 4 for Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network

Explanation and high-order reasoning capabilities are crucial for real-world visual question answering with diverse levels of inference complexity (e.g., what is the dog that is near the girl playing with?) and important for users to understand and diagnose the trustworthiness of the system. Current VQA benchmarks on natural images with only an accuracy metric end up pushing the models to exploit the dataset biases and cannot provide any interpretable justification, which severally hinders advances in high-level question answering. In this work, we propose a new HVQR benchmark for evaluating explainable and high-order visual question reasoning ability with three distinguishable merits: 1) the questions often contain one or two relationship triplets, which requires the model to have the ability of multistep reasoning to predict plausible answers; 2) we provide an explicit evaluation on a multistep reasoning process that is constructed with image scene graphs and commonsense knowledge bases; and 3) each relationship triplet in a large-scale knowledge base only appears once among all questions, which poses challenges for existing networks that often attempt to overfit the knowledge base that already appears in the training set and enforces the models to handle unseen questions and knowledge fact usage. We also propose a new knowledge-routed modular network (KM-net) that incorporates the multistep reasoning process over a large knowledge base into visual question reasoning. An extensive dataset analysis and comparisons with existing models on the HVQR benchmark show that our benchmark provides explainable evaluations, comprehensive reasoning requirements and realistic challenges of VQA systems, as well as our KM-net's superiority in terms of accuracy and explanation ability.

Viaarxiv icon

Interpretable Visual Question Answering by Reasoning on Dependency Trees

Sep 06, 2018
Qingxing Cao, Xiaodan Liang, Bailin Li, Liang Lin

Figure 1 for Interpretable Visual Question Answering by Reasoning on Dependency Trees
Figure 2 for Interpretable Visual Question Answering by Reasoning on Dependency Trees
Figure 3 for Interpretable Visual Question Answering by Reasoning on Dependency Trees
Figure 4 for Interpretable Visual Question Answering by Reasoning on Dependency Trees

Collaborative reasoning for understanding each image-question pair is very critical but underexplored for an interpretable visual question answering system. Although very recent works also attempted to use explicit compositional processes to assemble multiple subtasks embedded in the questions, their models heavily rely on annotations or handcrafted rules to obtain valid reasoning processes, leading to either heavy workloads or poor performance on composition reasoning. In this paper, to better align image and language domains in diverse and unrestricted cases, we propose a novel neural network model that performs global reasoning on a dependency tree parsed from the question, and we thus phrase our model as parse-tree-guided reasoning network (PTGRN). This network consists of three collaborative modules: i) an attention module to exploit the local visual evidence for each word parsed from the question, ii) a gated residual composition module to compose the previously mined evidence, and iii) a parse-tree-guided propagation module to pass the mined evidence along the parse tree. Our PTGRN is thus capable of building an interpretable VQA system that gradually derives the image cues following a question-driven parse-tree reasoning route. Experiments on relational datasets demonstrate the superiority of our PTGRN over current state-of-the-art VQA methods, and the visualization results highlight the explainable capability of our reasoning system.

* 14 pages, 10 figures. arXiv admin note: text overlap with arXiv:1804.00105 
Viaarxiv icon