Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiaodan Liang

Dynamic Knowledge Routing Network For Target-Guided Open-Domain Conversation

Mar 06, 2020

Jinghui Qin, Zheng Ye, Jianheng Tang, Xiaodan Liang

Figure 1 for Dynamic Knowledge Routing Network For Target-Guided Open-Domain Conversation

Figure 2 for Dynamic Knowledge Routing Network For Target-Guided Open-Domain Conversation

Figure 3 for Dynamic Knowledge Routing Network For Target-Guided Open-Domain Conversation

Figure 4 for Dynamic Knowledge Routing Network For Target-Guided Open-Domain Conversation

Abstract:Target-guided open-domain conversation aims to proactively and naturally guide a dialogue agent or human to achieve specific goals, topics or keywords during open-ended conversations. Existing methods mainly rely on single-turn datadriven learning and simple target-guided strategy without considering semantic or factual knowledge relations among candidate topics/keywords. This results in poor transition smoothness and low success rate. In this work, we adopt a structured approach that controls the intended content of system responses by introducing coarse-grained keywords, attains smooth conversation transition through turn-level supervised learning and knowledge relations between candidate keywords, and drives an conversation towards an specified target with discourse-level guiding strategy. Specially, we propose a novel dynamic knowledge routing network (DKRN) which considers semantic knowledge relations among candidate keywords for accurate next topic prediction of next discourse. With the help of more accurate keyword prediction, our keyword-augmented response retrieval module can achieve better retrieval performance and more meaningful conversations. Besides, we also propose a novel dual discourse-level target-guided strategy to guide conversations to reach their goals smoothly with higher success rate. Furthermore, to push the research boundary of target-guided open-domain conversation to match real-world scenarios better, we introduce a new large-scale Chinese target-guided open-domain conversation dataset (more than 900K conversations) crawled from Sina Weibo. Quantitative and human evaluations show our method can produce meaningful and effective target-guided conversations, significantly improving over other state-of-the-art methods by more than 20% in success rate and more than 0.6 in average smoothness score.

* 8 pages, 2 figues, 6tables, AAAI2020, fix our model's abbreviation

Via

Access Paper or Ask Questions

ElixirNet: Relation-aware Network Architecture Adaptation for Medical Lesion Detection

Mar 03, 2020

Chenhan Jiang, Shaoju Wang, Hang Xu, Xiaodan Liang, Nong Xiao

Figure 1 for ElixirNet: Relation-aware Network Architecture Adaptation for Medical Lesion Detection

Figure 2 for ElixirNet: Relation-aware Network Architecture Adaptation for Medical Lesion Detection

Figure 3 for ElixirNet: Relation-aware Network Architecture Adaptation for Medical Lesion Detection

Figure 4 for ElixirNet: Relation-aware Network Architecture Adaptation for Medical Lesion Detection

Abstract:Most advances in medical lesion detection network are limited to subtle modification on the conventional detection network designed for natural images. However, there exists a vast domain gap between medical images and natural images where the medical image detection often suffers from several domain-specific challenges, such as high lesion/background similarity, dominant tiny lesions, and severe class imbalance. Is a hand-crafted detection network tailored for natural image undoubtedly good enough over a discrepant medical lesion domain? Is there more powerful operations, filters, and sub-networks that better fit the medical lesion detection problem to be discovered? In this paper, we introduce a novel ElixirNet that includes three components: 1) TruncatedRPN balances positive and negative data for false positive reduction; 2) Auto-lesion Block is automatically customized for medical images to incorporate relation-aware operations among region proposals, and leads to more suitable and efficient classification and localization. 3) Relation transfer module incorporates the semantic relationship and transfers the relevant contextual information with an interpretable the graph thus alleviates the problem of lack of annotations for all types of lesions. Experiments on DeepLesion and Kits19 prove the effectiveness of ElixirNet, achieving improvement of both sensitivity and precision over FPN with fewer parameters.

* 7 pages, 5 figure, AAAI2020

Via

Access Paper or Ask Questions

Universal-RCNN: Universal Object Detector via Transferable Graph R-CNN

Feb 18, 2020

Hang Xu, Linpu Fang, Xiaodan Liang, Wenxiong Kang, Zhenguo Li

Figure 1 for Universal-RCNN: Universal Object Detector via Transferable Graph R-CNN

Figure 2 for Universal-RCNN: Universal Object Detector via Transferable Graph R-CNN

Figure 3 for Universal-RCNN: Universal Object Detector via Transferable Graph R-CNN

Figure 4 for Universal-RCNN: Universal Object Detector via Transferable Graph R-CNN

Abstract:The dominant object detection approaches treat each dataset separately and fit towards a specific domain, which cannot adapt to other domains without extensive retraining. In this paper, we address the problem of designing a universal object detection model that exploits diverse category granularity from multiple domains and predict all kinds of categories in one system. Existing works treat this problem by integrating multiple detection branches upon one shared backbone network. However, this paradigm overlooks the crucial semantic correlations between multiple domains, such as categories hierarchy, visual similarity, and linguistic relationship. To address these drawbacks, we present a novel universal object detector called Universal-RCNN that incorporates graph transfer learning for propagating relevant semantic information across multiple datasets to reach semantic coherency. Specifically, we first generate a global semantic pool by integrating all high-level semantic representation of all the categories. Then an Intra-Domain Reasoning Module learns and propagates the sparse graph representation within one dataset guided by a spatial-aware GCN. Finally, an InterDomain Transfer Module is proposed to exploit diverse transfer dependencies across all domains and enhance the regional feature representation by attending and transferring semantic contexts globally. Extensive experiments demonstrate that the proposed method significantly outperforms multiple-branch models and achieves the state-of-the-art results on multiple object detection benchmarks (mAP: 49.1% on COCO).

* Accepted by AAAI20

Via

Access Paper or Ask Questions

SM-NAS: Structural-to-Modular Neural Architecture Search for Object Detection

Nov 30, 2019

Lewei Yao, Hang Xu, Wei Zhang, Xiaodan Liang, Zhenguo Li

Figure 1 for SM-NAS: Structural-to-Modular Neural Architecture Search for Object Detection

Figure 2 for SM-NAS: Structural-to-Modular Neural Architecture Search for Object Detection

Figure 3 for SM-NAS: Structural-to-Modular Neural Architecture Search for Object Detection

Figure 4 for SM-NAS: Structural-to-Modular Neural Architecture Search for Object Detection

Abstract:The state-of-the-art object detection method is complicated with various modules such as backbone, feature fusion neck, RPN and RCNN head, where each module may have different designs and structures. How to leverage the computational cost and accuracy trade-off for the structural combination as well as the modular selection of multiple modules? Neural architecture search (NAS) has shown great potential in finding an optimal solution. Existing NAS works for object detection only focus on searching better design of a single module such as backbone or feature fusion neck, while neglecting the balance of the whole system. In this paper, we present a two-stage coarse-to-fine searching strategy named Structural-to-Modular NAS (SM-NAS) for searching a GPU-friendly design of both an efficient combination of modules and better modular-level architecture for object detection. Specifically, Structural-level searching stage first aims to find an efficient combination of different modules; Modular-level searching stage then evolves each specific module and pushes the Pareto front forward to a faster task-specific network. We consider a multi-objective search where the search space covers many popular designs of detection methods. We directly search a detection backbone without pre-trained models or any proxy task by exploring a fast training from scratch strategy. The resulting architectures dominate state-of-the-art object detection systems in both inference time and accuracy and demonstrate the effectiveness on multiple detection datasets, e.g. halving the inference time with additional 1% mAP improvement compared to FPN and reaching 46% mAP with the similar inference time of MaskRCNN.

* Accepted by AAAI 2020

Via

Access Paper or Ask Questions

Blockwisely Supervised Neural Architecture Search with Knowledge Distillation

Nov 29, 2019

Changlin Li, Jiefeng Peng, Liuchun Yuan, Guangrun Wang, Xiaodan Liang, Liang Lin, Xiaojun Chang

Figure 1 for Blockwisely Supervised Neural Architecture Search with Knowledge Distillation

Figure 2 for Blockwisely Supervised Neural Architecture Search with Knowledge Distillation

Figure 3 for Blockwisely Supervised Neural Architecture Search with Knowledge Distillation

Figure 4 for Blockwisely Supervised Neural Architecture Search with Knowledge Distillation

Abstract:Neural Architecture Search (NAS), aiming at automatically designing network architectures by machines, is hoped and expected to bring about a new revolution in machine learning. Despite these high expectation, the effectiveness and efficiency of existing NAS solutions are unclear, with some recent works going so far as to suggest that many existing NAS solutions are no better than random architecture selection. The inefficiency of NAS solutions may be attributed to inaccurate architecture evaluation. Specifically, to speed up NAS, recent works have proposed under-training different candidate architectures in a large search space concurrently by using shared network parameters; however, this has resulted in incorrect architecture ratings and furthered the ineffectiveness of NAS. In this work, we propose to modularize the large search space of NAS into blocks to ensure that the potential candidate architectures are fully trained; this reduces the representation shift caused by the shared parameters and leads to the correct rating of the candidates. Thanks to the block-wise search, we can also evaluate all of the candidate architectures within a block. Moreover, we find that the knowledge of a network model lies not only in the network parameters but also in the network architecture. Therefore, we propose to distill the neural architecture (DNA) knowledge from a teacher model as the supervision to guide our block-wise architecture search, which significantly improves the effectiveness of NAS. Remarkably, the capacity of our searched architecture has exceeded the teacher model, demonstrating the practicability and scalability of our method. Finally, our method achieves a state-of-the-art 78.4\% top-1 accuracy on ImageNet in a mobile setting, which is about a 2.1\% gain over EfficientNet-B0. All of our searched models along with the evaluation code are available online.

* We achieve a state-of-the-art 78.4% top-1 accuracy on ImageNet in a mobile setting, which is about a 2.1% gain over EfficientNet-B0

Via

Access Paper or Ask Questions

Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks

Nov 28, 2019

Fengda Zhu, Yi Zhu, Xiaojun Chang, Xiaodan Liang

Figure 1 for Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks

Figure 2 for Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks

Figure 3 for Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks

Figure 4 for Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks

Abstract:Vision-Language Navigation (VLN) is a task where agents learn to navigate following natural language instructions. The key to this task is to perceive both the visual scene and natural language sequentially. Conventional approaches exploit the vision and language features in cross-modal grounding. However, the VLN task remains challenging, since previous works have neglected the rich semantic information contained in the environment (such as implicit navigation graphs or sub-trajectory semantics). In this paper, we introduce Auxiliary Reasoning Navigation (AuxRN), a framework with four self-supervised auxiliary reasoning tasks to take advantage of the additional training signals derived from the semantic information. The auxiliary tasks have four reasoning objectives: explaining the previous actions, estimating the navigation progress, predicting the next orientation, and evaluating the trajectory consistency. As a result, these additional training signals help the agent to acquire knowledge of semantic representations in order to reason about its activity and build a thorough perception of the environment. Our experiments indicate that auxiliary reasoning tasks improve both the performance of the main task and the model generalizability by a large margin. Empirically, we demonstrate that an agent trained with self-supervised auxiliary reasoning tasks substantially outperforms the previous state-of-the-art method, being the best existing approach on the standard benchmark.

Via

Access Paper or Ask Questions

Heterogeneous Graph Learning for Visual Commonsense Reasoning

Oct 25, 2019

Weijiang Yu, Jingwen Zhou, Weihao Yu, Xiaodan Liang, Nong Xiao

Figure 1 for Heterogeneous Graph Learning for Visual Commonsense Reasoning

Figure 2 for Heterogeneous Graph Learning for Visual Commonsense Reasoning

Figure 3 for Heterogeneous Graph Learning for Visual Commonsense Reasoning

Figure 4 for Heterogeneous Graph Learning for Visual Commonsense Reasoning

Abstract:Visual commonsense reasoning task aims at leading the research field into solving cognition-level reasoning with the ability of predicting correct answers and meanwhile providing convincing reasoning paths, resulting in three sub-tasks i.e., Q->A, QA->R and Q->AR. It poses great challenges over the proper semantic alignment between vision and linguistic domains and knowledge reasoning to generate persuasive reasoning paths. Existing works either resort to a powerful end-to-end network that cannot produce interpretable reasoning paths or solely explore intra-relationship of visual objects (homogeneous graph) while ignoring the cross-domain semantic alignment among visual concepts and linguistic words. In this paper, we propose a new Heterogeneous Graph Learning (HGL) framework for seamlessly integrating the intra-graph and inter-graph reasoning in order to bridge vision and language domain. Our HGL consists of a primal vision-to-answer heterogeneous graph (VAHG) module and a dual question-to-answer heterogeneous graph (QAHG) module to interactively refine reasoning paths for semantic agreement. Moreover, our HGL integrates a contextual voting module to exploit a long-range visual context for better global reasoning. Experiments on the large-scale Visual Commonsense Reasoning benchmark demonstrate the superior performance of our proposed modules on three tasks (improving 5% accuracy on Q->A, 3.5% on QA->R, 5.8% on Q->AR)

* 11 pages, 5 figures

Via

Access Paper or Ask Questions

Layout-Graph Reasoning for Fashion Landmark Detection

Oct 04, 2019

Weijiang Yu, Xiaodan Liang, Ke Gong, Chenhan Jiang, Nong Xiao, Liang Lin

Figure 1 for Layout-Graph Reasoning for Fashion Landmark Detection

Figure 2 for Layout-Graph Reasoning for Fashion Landmark Detection

Figure 3 for Layout-Graph Reasoning for Fashion Landmark Detection

Figure 4 for Layout-Graph Reasoning for Fashion Landmark Detection

Abstract:Detecting dense landmarks for diverse clothes, as a fundamental technique for clothes analysis, has attracted increasing research attention due to its huge application potential. However, due to the lack of modeling underlying semantic layout constraints among landmarks, prior works often detect ambiguous and structure-inconsistent landmarks of multiple overlapped clothes in one person. In this paper, we propose to seamlessly enforce structural layout relationships among landmarks on the intermediate representations via multiple stacked layout-graph reasoning layers. We define the layout-graph as a hierarchical structure including a root node, body-part nodes (e.g. upper body, lower body), coarse clothes-part nodes (e.g. collar, sleeve) and leaf landmark nodes (e.g. left-collar, right-collar). Each Layout-Graph Reasoning(LGR) layer aims to map feature representations into structural graph nodes via a Map-to-Node module, performs reasoning over structural graph nodes to achieve global layout coherency via a layout-graph reasoning module, and then maps graph nodes back to enhance feature representations via a Node-to-Map module. The layout-graph reasoning module integrates a graph clustering operation to generate representations of intermediate nodes (bottom-up inference) and then a graph deconvolution operation (top-down inference) over the whole graph. Extensive experiments on two public fashion landmark datasets demonstrate the superiority of our model. Furthermore, to advance the fine-grained fashion landmark research for supporting more comprehensive clothes generation and attribute recognition, we contribute the first Fine-grained Fashion Landmark Dataset (FFLD) containing 200k images annotated with at most 32 key-points for 13 clothes types.

* 9 pages, 5 figures, CVPR2019

Via

Access Paper or Ask Questions

Meta R-CNN : Towards General Solver for Instance-level Low-shot Learning

Sep 28, 2019

Xiaopeng Yan, Ziliang Chen, Anni Xu, Xiaoxi Wang, Xiaodan Liang, Liang Lin

Figure 1 for Meta R-CNN : Towards General Solver for Instance-level Low-shot Learning

Figure 2 for Meta R-CNN : Towards General Solver for Instance-level Low-shot Learning

Figure 3 for Meta R-CNN : Towards General Solver for Instance-level Low-shot Learning

Figure 4 for Meta R-CNN : Towards General Solver for Instance-level Low-shot Learning

Abstract:Resembling the rapid learning capability of human, low-shot learning empowers vision systems to understand new concepts by training with few samples. Leading approaches derived from meta-learning on images with a single visual object. Obfuscated by a complex background and multiple objects in one image, they are hard to promote the research of low-shot object detection/segmentation. In this work, we present a flexible and general methodology to achieve these tasks. Our work extends Faster /Mask R-CNN by proposing meta-learning over RoI (Region-of-Interest) features instead of a full image feature. This simple spirit disentangles multi-object information merged with the background, without bells and whistles, enabling Faster /Mask R-CNN turn into a meta-learner to achieve the tasks. Specifically, we introduce a Predictor-head Remodeling Network (PRN) that shares its main backbone with Faster /Mask R-CNN. PRN receives images containing low-shot objects with their bounding boxes or masks to infer their class attentive vectors. The vectors take channel-wise soft-attention on RoI features, remodeling those R-CNN predictor heads to detect or segment the objects that are consistent with the classes these vectors represent. In our experiments, Meta R-CNN yields the state of the art in low-shot object detection and improves low-shot object segmentation by Mask R-CNN.

* Published in ICCV-2019. Project: https://yanxp.github.io/metarcnn.html

Via

Access Paper or Ask Questions

Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network

Sep 23, 2019

Qingxing Cao, Bailin Li, Xiaodan Liang, Liang Lin

Figure 1 for Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network

Figure 2 for Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network

Figure 3 for Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network

Figure 4 for Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network

Abstract:Explanation and high-order reasoning capabilities are crucial for real-world visual question answering with diverse levels of inference complexity (e.g., what is the dog that is near the girl playing with?) and important for users to understand and diagnose the trustworthiness of the system. Current VQA benchmarks on natural images with only an accuracy metric end up pushing the models to exploit the dataset biases and cannot provide any interpretable justification, which severally hinders advances in high-level question answering. In this work, we propose a new HVQR benchmark for evaluating explainable and high-order visual question reasoning ability with three distinguishable merits: 1) the questions often contain one or two relationship triplets, which requires the model to have the ability of multistep reasoning to predict plausible answers; 2) we provide an explicit evaluation on a multistep reasoning process that is constructed with image scene graphs and commonsense knowledge bases; and 3) each relationship triplet in a large-scale knowledge base only appears once among all questions, which poses challenges for existing networks that often attempt to overfit the knowledge base that already appears in the training set and enforces the models to handle unseen questions and knowledge fact usage. We also propose a new knowledge-routed modular network (KM-net) that incorporates the multistep reasoning process over a large knowledge base into visual question reasoning. An extensive dataset analysis and comparisons with existing models on the HVQR benchmark show that our benchmark provides explainable evaluations, comprehensive reasoning requirements and realistic challenges of VQA systems, as well as our KM-net's superiority in terms of accuracy and explanation ability.

Via

Access Paper or Ask Questions