Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tao Mei

WIDER Face and Pedestrian Challenge 2018: Methods and Results

Feb 19, 2019

Chen Change Loy, Dahua Lin, Wanli Ouyang, Yuanjun Xiong, Shuo Yang, Qingqiu Huang, Dongzhan Zhou, Wei Xia, Quanquan Li, Ping Luo(+42 more)

Figure 1 for WIDER Face and Pedestrian Challenge 2018: Methods and Results

Figure 2 for WIDER Face and Pedestrian Challenge 2018: Methods and Results

Figure 3 for WIDER Face and Pedestrian Challenge 2018: Methods and Results

Figure 4 for WIDER Face and Pedestrian Challenge 2018: Methods and Results

Abstract:This paper presents a review of the 2018 WIDER Challenge on Face and Pedestrian. The challenge focuses on the problem of precise localization of human faces and bodies, and accurate association of identities. It comprises of three tracks: (i) WIDER Face which aims at soliciting new approaches to advance the state-of-the-art in face detection, (ii) WIDER Pedestrian which aims to find effective and efficient approaches to address the problem of pedestrian detection in unconstrained environments, and (iii) WIDER Person Search which presents an exciting challenge of searching persons across 192 movies. In total, 73 teams made valid submissions to the challenge tracks. We summarize the winning solutions for all three tracks. and present discussions on open problems and potential research directions in these topics.

* Report of ECCV 2018 workshop: WIDER Face and Pedestrian Challenge

Via

Access Paper or Ask Questions

Rethinking Visual Relationships for High-level Image Understanding

Feb 01, 2019

Yuanzhi Liang, Yalong Bai, Wei Zhang, Xueming Qian, Li Zhu, Tao Mei

Figure 1 for Rethinking Visual Relationships for High-level Image Understanding

Figure 2 for Rethinking Visual Relationships for High-level Image Understanding

Figure 3 for Rethinking Visual Relationships for High-level Image Understanding

Figure 4 for Rethinking Visual Relationships for High-level Image Understanding

Abstract:Relationships, as the bond of isolated entities in images, reflect the interaction between objects and lead to a semantic understanding of scenes. Suffering from visually-irrelevant relationships in current scene graph datasets, the utilization of relationships for semantic tasks is difficult. The datasets widely used in scene graph generation tasks are splitted from Visual Genome by label frequency, which even can be well solved by statistical counting. To encourage further development in relationships, we propose a novel method to mine more valuable relationships by automatically filtering out visually-irrelevant relationships. Then, we construct a new scene graph dataset named Visually-Relevant Relationships Dataset (VrR-VG) from Visual Genome. We evaluate several existing methods in scene graph generation in our dataset. The results show the performances degrade significantly compared to the previous dataset and the frequency analysis do not work on our dataset anymore. Moreover, we propose a method to learn feature representations of instances, attributes, and visual relationships jointly from images, then we apply the learned features to image captioning and visual question answering respectively. The improvements on the both tasks demonstrate the efficiency of the features with relation information and the richer semantic information provided in our dataset.

Via

Access Paper or Ask Questions

Improved Selective Refinement Network for Face Detection

Jan 23, 2019

Shifeng Zhang, Rui Zhu, Xiaobo Wang, Hailin Shi, Tianyu Fu, Shuo Wang, Tao Mei, Stan Z. Li

Figure 1 for Improved Selective Refinement Network for Face Detection

Figure 2 for Improved Selective Refinement Network for Face Detection

Figure 3 for Improved Selective Refinement Network for Face Detection

Figure 4 for Improved Selective Refinement Network for Face Detection

Abstract:As a long-standing problem in computer vision, face detection has attracted much attention in recent decades for its practical applications. With the availability of face detection benchmark WIDER FACE dataset, much of the progresses have been made by various algorithms in recent years. Among them, the Selective Refinement Network (SRN) face detector introduces the two-step classification and regression operations selectively into an anchor-based face detector to reduce false positives and improve location accuracy simultaneously. Moreover, it designs a receptive field enhancement block to provide more diverse receptive field. In this report, to further improve the performance of SRN, we exploit some existing techniques via extensive experiments, including new data augmentation strategy, improved backbone network, MS COCO pretraining, decoupled classification module, segmentation branch and Squeeze-and-Excitation block. Some of these techniques bring performance improvements, while few of them do not well adapt to our baseline. As a consequence, we present an improved SRN face detector by combining these useful techniques together and obtain the best performance on widely used face detection benchmark WIDER FACE dataset.

* Technical report, 8 pages, 6 figures

Via

Access Paper or Ask Questions

Multi-Granularity Reasoning for Social Relation Recognition from Images

Jan 10, 2019

Meng Zhang, Xinchen Liu, Wu Liu, Anfu Zhou, Huadong Ma, Tao Mei

Figure 1 for Multi-Granularity Reasoning for Social Relation Recognition from Images

Figure 2 for Multi-Granularity Reasoning for Social Relation Recognition from Images

Figure 3 for Multi-Granularity Reasoning for Social Relation Recognition from Images

Abstract:Discovering social relations in images can make machines better interpret the behavior of human beings. However, automatically recognizing social relations in images is a challenging task due to the significant gap between the domains of visual content and social relation. Existing studies separately process various features such as faces expressions, body appearance, and contextual objects, thus they cannot comprehensively capture the multi-granularity semantics, such as scenes, regional cues of persons, and interactions among persons and objects. To bridge the domain gap, we propose a Multi-Granularity Reasoning framework for social relation recognition from images. The global knowledge and mid-level details are learned from the whole scene and the regions of persons and objects, respectively. Most importantly, we explore the fine-granularity pose keypoints of persons to discover the interactions among persons and objects. Specifically, the pose-guided Person-Object Graph and Person-Pose Graph are proposed to model the actions from persons to object and the interactions between paired persons, respectively. Based on the graphs, social relation reasoning is performed by graph convolutional networks. Finally, the global features and reasoned knowledge are integrated as a comprehensive representation for social relation recognition. Extensive experiments on two public datasets show the effectiveness of the proposed framework.

Via

Access Paper or Ask Questions

Support Vector Guided Softmax Loss for Face Recognition

Dec 29, 2018

Xiaobo Wang, Shuo Wang, Shifeng Zhang, Tianyu Fu, Hailin Shi, Tao Mei

Figure 1 for Support Vector Guided Softmax Loss for Face Recognition

Figure 2 for Support Vector Guided Softmax Loss for Face Recognition

Figure 3 for Support Vector Guided Softmax Loss for Face Recognition

Figure 4 for Support Vector Guided Softmax Loss for Face Recognition

Abstract:Face recognition has witnessed significant progresses due to the advances of deep convolutional neural networks (CNNs), the central challenge of which, is feature discrimination. To address it, one group tries to exploit mining-based strategies (\textit{e.g.}, hard example mining and focal loss) to focus on the informative examples. The other group devotes to designing margin-based loss functions (\textit{e.g.}, angular, additive and additive angular margins) to increase the feature margin from the perspective of ground truth class. Both of them have been well-verified to learn discriminative features. However, they suffer from either the ambiguity of hard examples or the lack of discriminative power of other classes. In this paper, we design a novel loss function, namely support vector guided softmax loss (SV-Softmax), which adaptively emphasizes the mis-classified points (support vectors) to guide the discriminative features learning. So the developed SV-Softmax loss is able to eliminate the ambiguity of hard examples as well as absorb the discriminative power of other classes, and thus results in more discrimiantive features. To the best of our knowledge, this is the first attempt to inherit the advantages of mining-based and margin-based losses into one framework. Experimental results on several benchmarks have demonstrated the effectiveness of our approach over state-of-the-arts.

Via

Access Paper or Ask Questions

To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression

Nov 03, 2018

Yitian Yuan, Tao Mei, Wenwu Zhu

Figure 1 for To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression

Figure 2 for To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression

Figure 3 for To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression

Figure 4 for To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression

Abstract:Given an untrimmed video and a sentence description, temporal sentence localization aims to automatically determine the start and end points of the described sentence within the video. The problem is challenging as it needs the understanding of both video and sentence. Existing research predominantly employs a costly "scan and localize" framework, neglecting the global video context and the specific details within sentences which play as critical issues for this problem. In this paper, we propose a novel Attention Based Location Regression (ABLR) approach to solve the temporal sentence localization from a global perspective. Specifically, to preserve the context information, ABLR first encodes both video and sentence via Bidirectional LSTM networks. Then, a multi-modal co-attention mechanism is introduced to generate not only video attention which reflects the global video structure, but also sentence attention which highlights the crucial details for temporal localization. Finally, a novel attention based location regression network is designed to predict the temporal coordinates of sentence query from the previous attention. ABLR is jointly trained in an end-to-end manner. Comprehensive experiments on ActivityNet Captions and TACoS datasets demonstrate both the effectiveness and the efficiency of the proposed ABLR approach.

Via

Access Paper or Ask Questions

ScratchDet:Exploring to Train Single-Shot Object Detectors from Scratch

Oct 19, 2018

Rui Zhu, Shifeng Zhang, Xiaobo Wang, Longyin Wen, Hailin Shi, Liefeng Bo, Tao Mei

Figure 1 for ScratchDet:Exploring to Train Single-Shot Object Detectors from Scratch

Figure 2 for ScratchDet:Exploring to Train Single-Shot Object Detectors from Scratch

Figure 3 for ScratchDet:Exploring to Train Single-Shot Object Detectors from Scratch

Figure 4 for ScratchDet:Exploring to Train Single-Shot Object Detectors from Scratch

Abstract:Current state-of-the-art object objectors are fine-tuned from the off-the-shelf networks pretrained on large-scale classification datasets like ImageNet, which incurs some accessory problems: 1) the domain gap between source and target datasets; 2) the learning objective bias between classification and detection; 3) the architecture limitations of the classification network for detection. In this paper, we design a new single-shot train-from-scratch object detector referring to the architectures of the ResNet and VGGNet based SSD models, called ScratchDet, to alleviate the aforementioned problems. Specifically, we study the impact of BatchNorm on training detectors from scratch, and find that using BatchNorm on the backbone and detection head subnetworks makes the detector converge well from scratch. After that, we explore the network architecture by analyzing the detection performance of ResNet and VGGNet, and introduce a new Root-ResNet backbone network to further improve the accuracy. Extensive experiments on PASCAL VOC 2007, 2012 and MS COCO datasets demonstrate that ScratchDet achieves the state-of-the-art performance among all the train-from-scratch detectors and even outperforms existing one-stage pretrained methods without bells and whistles. Codes will be made publicly available at https://github.com/KimSoybean/ScratchDet.

* 14 pages, 9 figures, submitted to AAAI2019

Via

Access Paper or Ask Questions

KTAN: Knowledge Transfer Adversarial Network

Oct 18, 2018

Peiye Liu, Wu Liu, Huadong Ma, Tao Mei, Mingoo Seok

Figure 1 for KTAN: Knowledge Transfer Adversarial Network

Figure 2 for KTAN: Knowledge Transfer Adversarial Network

Figure 3 for KTAN: Knowledge Transfer Adversarial Network

Figure 4 for KTAN: Knowledge Transfer Adversarial Network

Abstract:To reduce the large computation and storage cost of a deep convolutional neural network, the knowledge distillation based methods have pioneered to transfer the generalization ability of a large (teacher) deep network to a light-weight (student) network. However, these methods mostly focus on transferring the probability distribution of the softmax layer in a teacher network and thus neglect the intermediate representations. In this paper, we propose a knowledge transfer adversarial network to better train a student network. Our technique holistically considers both intermediate representations and probability distributions of a teacher network. To transfer the knowledge of intermediate representations, we set high-level teacher feature maps as a target, toward which the student feature maps are trained. Specifically, we arrange a Teacher-to-Student layer for enabling our framework suitable for various student structures. The intermediate representation helps the student network better understand the transferred generalization as compared to the probability distribution only. Furthermore, we infuse an adversarial learning process by employing a discriminator network, which can fully exploit the spatial correlation of feature maps in training a student network. The experimental results demonstrate that the proposed method can significantly improve the performance of a student network on both image classification and object detection tasks.

* 8 pages, 2 figures

Via

Access Paper or Ask Questions

Exploring Visual Relationship for Image Captioning

Sep 19, 2018

Ting Yao, Yingwei Pan, Yehao Li, Tao Mei

Figure 1 for Exploring Visual Relationship for Image Captioning

Figure 2 for Exploring Visual Relationship for Image Captioning

Figure 3 for Exploring Visual Relationship for Image Captioning

Figure 4 for Exploring Visual Relationship for Image Captioning

Abstract:It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image. Nevertheless, there has not been evidence in support of the idea on image description generation. In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework. Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that novelly integrates both semantic and spatial object relationships into image encoder. Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections. The representations of each region proposed on objects are then refined by leveraging graph structure through GCN. With the learnt region-level features, our GCN-LSTM capitalizes on LSTM-based captioning framework with attention mechanism for sentence generation. Extensive experiments are conducted on COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, GCN-LSTM increases CIDEr-D performance from 120.1% to 128.7% on COCO testing set.

* ECCV 2018

Via

Access Paper or Ask Questions

Subspace Clustering by Block Diagonal Representation

May 23, 2018

Canyi Lu, Jiashi Feng, Zhouchen Lin, Tao Mei, Shuicheng Yan

Figure 1 for Subspace Clustering by Block Diagonal Representation

Figure 2 for Subspace Clustering by Block Diagonal Representation

Figure 3 for Subspace Clustering by Block Diagonal Representation

Figure 4 for Subspace Clustering by Block Diagonal Representation

Abstract:This paper studies the subspace clustering problem. Given some data points approximately drawn from a union of subspaces, the goal is to group these data points into their underlying subspaces. Many subspace clustering methods have been proposed and among which sparse subspace clustering and low-rank representation are two representative ones. Despite the different motivations, we observe that many existing methods own the common block diagonal property, which possibly leads to correct clustering, yet with their proofs given case by case. In this work, we consider a general formulation and provide a unified theoretical guarantee of the block diagonal property. The block diagonal property of many existing methods falls into our special case. Second, we observe that many existing methods approximate the block diagonal representation matrix by using different structure priors, e.g., sparsity and low-rankness, which are indirect. We propose the first block diagonal matrix induced regularizer for directly pursuing the block diagonal matrix. With this regularizer, we solve the subspace clustering problem by Block Diagonal Representation (BDR), which uses the block diagonal structure prior. The BDR model is nonconvex and we propose an alternating minimization solver and prove its convergence. Experiments on real datasets demonstrate the effectiveness of BDR.

* IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018

Via

Access Paper or Ask Questions