Alert button
Picture for Tianshui Chen

Tianshui Chen

Alert button

Active Object Search

Aug 03, 2020
Jie Wu, Tianshui Chen, Lishan Huang, Hefeng Wu, Guanbin Li, Ling Tian, Liang Lin

Figure 1 for Active Object Search
Figure 2 for Active Object Search
Figure 3 for Active Object Search
Figure 4 for Active Object Search

In this work, we investigate an Active Object Search (AOS) task that is not explicitly addressed in the literature; it aims to actively performs as few action steps as possible to search and locate the target object in a 3D indoor scene. Different from classic object detection that passively receives visual information, this task encourages an intelligent agent to perform active search via reasonable action planning; thus it can better recall the target objects, especially for the challenging situations that the target is far from the agent, blocked by an obstacle and out of view. To handle this cross-modal task, we formulate a reinforcement learning framework that consists of a 3D object detector, a state controller and a cross-modal action planner to work cooperatively to find out the target object with minimal action steps. During training, we design a novel cost-sensitive active search reward that penalizes inaccurate object search and redundant action steps. To evaluate this novel task, we construct an Active Object Search (AOS) benchmark that contains 5,845 samples from 30 diverse indoor scenes. We conduct extensive qualitative and quantitative evaluations on this benchmark to demonstrate the effectiveness of the proposed approach and analyze the key factors that contribute more to address this task.

* Accepted at ACM MM 2020 
Viaarxiv icon

Adversarial Graph Representation Adaptation for Cross-Domain Facial Expression Recognition

Aug 03, 2020
Yuan Xie, Tianshui Chen, Tao Pu, Hefeng Wu, Liang Lin

Figure 1 for Adversarial Graph Representation Adaptation for Cross-Domain Facial Expression Recognition
Figure 2 for Adversarial Graph Representation Adaptation for Cross-Domain Facial Expression Recognition
Figure 3 for Adversarial Graph Representation Adaptation for Cross-Domain Facial Expression Recognition
Figure 4 for Adversarial Graph Representation Adaptation for Cross-Domain Facial Expression Recognition

Data inconsistency and bias are inevitable among different facial expression recognition (FER) datasets due to subjective annotating process and different collecting conditions. Recent works resort to adversarial mechanisms that learn domain-invariant features to mitigate domain shift. However, most of these works focus on holistic feature adaptation, and they ignore local features that are more transferable across different datasets. Moreover, local features carry more detailed and discriminative content for expression recognition, and thus integrating local features may enable fine-grained adaptation. In this work, we propose a novel Adversarial Graph Representation Adaptation (AGRA) framework that unifies graph representation propagation with adversarial learning for cross-domain holistic-local feature co-adaptation. To achieve this, we first build a graph to correlate holistic and local regions within each domain and another graph to correlate these regions across different domains. Then, we learn the per-class statistical distribution of each domain and extract holistic-local features from the input image to initialize the corresponding graph nodes. Finally, we introduce two stacked graph convolution networks to propagate holistic-local feature within each domain to explore their interaction and across different domains for holistic-local feature co-adaptation. In this way, the AGRA framework can adaptively learn fine-grained domain-invariant features and thus facilitate cross-domain expression recognition. We conduct extensive and fair experiments on several popular benchmarks and show that the proposed AGRA framework achieves superior performance over previous state-of-the-art methods.

* Accepted at ACM MM 2020 
Viaarxiv icon

Fine-Grained Image Captioning with Global-Local Discriminative Objective

Jul 21, 2020
Jie Wu, Tianshui Chen, Hefeng Wu, Zhi Yang, Guangchun Luo, Liang Lin

Figure 1 for Fine-Grained Image Captioning with Global-Local Discriminative Objective
Figure 2 for Fine-Grained Image Captioning with Global-Local Discriminative Objective
Figure 3 for Fine-Grained Image Captioning with Global-Local Discriminative Objective
Figure 4 for Fine-Grained Image Captioning with Global-Local Discriminative Objective

Significant progress has been made in recent years in image captioning, an active topic in the fields of vision and language. However, existing methods tend to yield overly general captions and consist of some of the most frequent words/phrases, resulting in inaccurate and indistinguishable descriptions (see Figure 1). This is primarily due to (i) the conservative characteristic of traditional training objectives that drives the model to generate correct but hardly discriminative captions for similar images and (ii) the uneven word distribution of the ground-truth captions, which encourages generating highly frequent words/phrases while suppressing the less frequent but more concrete ones. In this work, we propose a novel global-local discriminative objective that is formulated on top of a reference model to facilitate generating fine-grained descriptive captions. Specifically, from a global perspective, we design a novel global discriminative constraint that pulls the generated sentence to better discern the corresponding image from all others in the entire dataset. From the local perspective, a local discriminative constraint is proposed to increase attention such that it emphasizes the less frequent but more concrete words/phrases, thus facilitating the generation of captions that better describe the visual details of the given images. We evaluate the proposed method on the widely used MS-COCO dataset, where it outperforms the baseline methods by a sizable margin and achieves competitive performance over existing leading approaches. We also conduct self-retrieval experiments to demonstrate the discriminability of the proposed method.

* Accepted by TMM 
Viaarxiv icon

Efficient Crowd Counting via Structured Knowledge Transfer

Apr 26, 2020
Lingbo Liu, Jiaqi Chen, Hefeng Wu, Tianshui Chen, Guanbin Li, Liang Lin

Figure 1 for Efficient Crowd Counting via Structured Knowledge Transfer
Figure 2 for Efficient Crowd Counting via Structured Knowledge Transfer
Figure 3 for Efficient Crowd Counting via Structured Knowledge Transfer
Figure 4 for Efficient Crowd Counting via Structured Knowledge Transfer

Crowd counting is an application-oriented task and its inference efficiency is crucial for real-world applications. However, most previous works relied on heavy backbone networks and required prohibitive run-time consumption, which would seriously restrict their deployment scopes and cause poor scalability. To liberate these crowd counting models, we propose a novel Structured Knowledge Transfer (SKT) framework, which fully exploits the structured knowledge of a well-trained teacher network to generate a lightweight but still highly effective student network. Specifically, it is integrated with two complementary transfer modules, including an Intra-Layer Pattern Transfer which sequentially distills the knowledge embedded in layer-wise features of the teacher network to guide feature learning of the student network and an Inter-Layer Relation Transfer which densely distills the cross-layer correlation knowledge of the teacher to regularize the student's feature evolution. In this way, our student network can derive the layer-wise and cross-layer knowledge from the teacher network to learn compact yet effective features. Extensive evaluations on three benchmarks well demonstrate the effectiveness of our SKT for extensive crowd counting models. In particular, only using around $6\%$ of the parameters and computation cost of original models, our distilled VGG-based models obtain at least 6.5$\times$ speed-up on an Nvidia 1080 GPU and even achieve state-of-the-art performance.

Viaarxiv icon

Knowledge Graph Transfer Network for Few-Shot Recognition

Nov 21, 2019
Riquan Chen, Tianshui Chen, Xiaolu Hui, Hefeng Wu, Guanbin Li, Liang Lin

Figure 1 for Knowledge Graph Transfer Network for Few-Shot Recognition
Figure 2 for Knowledge Graph Transfer Network for Few-Shot Recognition
Figure 3 for Knowledge Graph Transfer Network for Few-Shot Recognition
Figure 4 for Knowledge Graph Transfer Network for Few-Shot Recognition

Few-shot learning aims to learn novel categories from very few samples given some base categories with sufficient training samples. The main challenge of this task is the novel categories are prone to dominated by color, texture, shape of the object or background context (namely specificity), which are distinct for the given few training samples but not common for the corresponding categories (see Figure 1). Fortunately, we find that transferring information of the correlated based categories can help learn the novel concepts and thus avoid the novel concept being dominated by the specificity. Besides, incorporating semantic correlations among different categories can effectively regularize this information transfer. In this work, we represent the semantic correlations in the form of structured knowledge graph and integrate this graph into deep neural networks to promote few-shot learning by a novel Knowledge Graph Transfer Network (KGTN). Specifically, by initializing each node with the classifier weight of the corresponding category, a propagation mechanism is learned to adaptively propagate node message through the graph to explore node interaction and transfer classifier information of the base categories to those of the novel ones. Extensive experiments on the ImageNet dataset show significant performance improvement compared with current leading competitors. Furthermore, we construct an ImageNet-6K dataset that covers larger scale categories, i.e, 6,000 categories, and experiments on this dataset further demonstrate the effectiveness of our proposed model.

* accepted by AAAI 2020 as oral paper 
Viaarxiv icon

Learning Semantic-Specific Graph Representation for Multi-Label Image Recognition

Aug 20, 2019
Tianshui Chen, Muxin Xu, Xiaolu Hui, Hefeng Wu, Liang Lin

Figure 1 for Learning Semantic-Specific Graph Representation for Multi-Label Image Recognition
Figure 2 for Learning Semantic-Specific Graph Representation for Multi-Label Image Recognition
Figure 3 for Learning Semantic-Specific Graph Representation for Multi-Label Image Recognition
Figure 4 for Learning Semantic-Specific Graph Representation for Multi-Label Image Recognition

Recognizing multiple labels of images is a practical and challenging task, and significant progress has been made by searching semantic-aware regions and modeling label dependency. However, current methods cannot locate the semantic regions accurately due to the lack of part-level supervision or semantic guidance. Moreover, they cannot fully explore the mutual interactions among the semantic regions and do not explicitly model the label co-occurrence. To address these issues, we propose a Semantic-Specific Graph Representation Learning (SSGRL) framework that consists of two crucial modules: 1) a semantic decoupling module that incorporates category semantics to guide learning semantic-specific representations and 2) a semantic interaction module that correlates these representations with a graph built on the statistical label co-occurrence and explores their interactions via a graph propagation mechanism. Extensive experiments on public benchmarks show that our SSGRL framework outperforms current state-of-the-art methods by a sizable margin, e.g. with an mAP improvement of 2.5%, 2.6%, 6.7%, and 3.1% on the PASCAL VOC 2007 & 2012, Microsoft-COCO and Visual Genome benchmarks, respectively. Our codes and models are available at https://github.com/HCPLab-SYSU/SSGRL.

* accepted by ICCV 2019 
Viaarxiv icon

Semi-Supervised Video Salient Object Detection Using Pseudo-Labels

Aug 12, 2019
Pengxiang Yan, Guanbin Li, Yuan Xie, Zhen Li, Chuan Wang, Tianshui Chen, Liang Lin

Figure 1 for Semi-Supervised Video Salient Object Detection Using Pseudo-Labels
Figure 2 for Semi-Supervised Video Salient Object Detection Using Pseudo-Labels
Figure 3 for Semi-Supervised Video Salient Object Detection Using Pseudo-Labels
Figure 4 for Semi-Supervised Video Salient Object Detection Using Pseudo-Labels

Deep learning-based video salient object detection has recently achieved great success with its performance significantly outperforming any other unsupervised methods. However, existing data-driven approaches heavily rely on a large quantity of pixel-wise annotated video frames to deliver such promising results. In this paper, we address the semi-supervised video salient object detection task using pseudo-labels. Specifically, we present an effective video saliency detector that consists of a spatial refinement network and a spatiotemporal module. Based on the same refinement network and motion information in terms of optical flow, we further propose a novel method for generating pixel-level pseudo-labels from sparsely annotated frames. By utilizing the generated pseudo-labels together with a part of manual annotations, our video saliency detector learns spatial and temporal cues for both contrast inference and coherence enhancement, thus producing accurate saliency maps. Experimental results demonstrate that our proposed semi-supervised method even greatly outperforms all the state-of-the-art fully supervised methods across three public benchmarks of VOS, DAVIS, and FBMS.

* Accepted by ICCV 2019 
Viaarxiv icon

Knowledge-Embedded Routing Network for Scene Graph Generation

Mar 08, 2019
Tianshui Chen, Weihao Yu, Riquan Chen, Liang Lin

Figure 1 for Knowledge-Embedded Routing Network for Scene Graph Generation
Figure 2 for Knowledge-Embedded Routing Network for Scene Graph Generation
Figure 3 for Knowledge-Embedded Routing Network for Scene Graph Generation
Figure 4 for Knowledge-Embedded Routing Network for Scene Graph Generation

To understand a scene in depth not only involves locating/recognizing individual objects, but also requires to infer the relationships and interactions among them. However, since the distribution of real-world relationships is seriously unbalanced, existing methods perform quite poorly for the less frequent relationships. In this work, we find that the statistical correlations between object pairs and their relationships can effectively regularize semantic space and make prediction less ambiguous, and thus well address the unbalanced distribution issue. To achieve this, we incorporate these statistical correlations into deep neural networks to facilitate scene graph generation by developing a Knowledge-Embedded Routing Network. More specifically, we show that the statistical correlations between objects appearing in images and their relationships, can be explicitly represented by a structured knowledge graph, and a routing mechanism is learned to propagate messages through the graph to explore their interactions. Extensive experiments on the large-scale Visual Genome dataset demonstrate the superiority of the proposed method over current state-of-the-art competitors.

* Accepted by CVPR 2019 
Viaarxiv icon

Neural Task Planning with And-Or Graph Representations

Aug 25, 2018
Tianshui Chen, Riquan Chen, Lin Nie, Xiaonan Luo, Xiaobai Liu, Liang Lin

Figure 1 for Neural Task Planning with And-Or Graph Representations
Figure 2 for Neural Task Planning with And-Or Graph Representations
Figure 3 for Neural Task Planning with And-Or Graph Representations
Figure 4 for Neural Task Planning with And-Or Graph Representations

This paper focuses on semantic task planning, i.e., predicting a sequence of actions toward accomplishing a specific task under a certain scene, which is a new problem in computer vision research. The primary challenges are how to model task-specific knowledge and how to integrate this knowledge into the learning procedure. In this work, we propose training a recurrent long short-term memory (LSTM) network to address this problem, i.e., taking a scene image (including pre-located objects) and the specified task as input and recurrently predicting action sequences. However, training such a network generally requires large numbers of annotated samples to cover the semantic space (e.g., diverse action decomposition and ordering). To overcome this issue, we introduce a knowledge and-or graph (AOG) for task description, which hierarchically represents a task as atomic actions. With this AOG representation, we can produce many valid samples (i.e., action sequences according to common sense) by training another auxiliary LSTM network with a small set of annotated samples. Furthermore, these generated samples (i.e., task-oriented action sequences) effectively facilitate training of the model for semantic task planning. In our experiments, we create a new dataset that contains diverse daily tasks and extensively evaluate the effectiveness of our approach.

* Submitted to TMM, under minor revision. arXiv admin note: text overlap with arXiv:1707.04677 
Viaarxiv icon