Alert button
Picture for Haibin Ling

Haibin Ling

Alert button

M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network

Nov 13, 2018
Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, Haibin Ling

Figure 1 for M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network
Figure 2 for M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network
Figure 3 for M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network
Figure 4 for M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network

Feature pyramids are widely exploited by both the state-of-the-art one-stage object detectors (e.g., DSSD, RetinaNet, RefineDet) and the two-stage object detectors (e.g., Mask R-CNN, DetNet) to alleviate the problem arising from scale variation across object instances. Although these object detectors with feature pyramids achieve encouraging results, they have some limitations due to that they only simply construct the feature pyramid according to the inherent multi-scale, pyramidal architecture of the backbones which are actually designed for object classification task. Newly, in this work, we present a method called Multi-Level Feature Pyramid Network (MLFPN) to construct more effective feature pyramids for detecting objects of different scales. First, we fuse multi-level features (i.e. multiple layers) extracted by backbone as the base feature. Second, we feed the base feature into a block of alternating joint Thinned U-shape Modules and Feature Fusion Modules and exploit the decoder layers of each u-shape module as the features for detecting objects. Finally, we gather up the decoder layers with equivalent scales (sizes) to develop a feature pyramid for object detection, in which every feature map consists of the layers (features) from multiple levels. To evaluate the effectiveness of the proposed MLFPN, we design and train a powerful end-to-end one-stage object detector we call M2Det by integrating it into the architecture of SSD, which gets better detection performance than state-of-the-art one-stage detectors. Specifically, on MS-COCO benchmark, M2Det achieves AP of 41.0 at speed of 11.8 FPS with single-scale inference strategy and AP of 44.2 with multi-scale inference strategy, which is the new state-of-the-art results among one-stage detectors. The code will be made available on \url{https://github.com/qijiezhao/M2Det.

* AAAI19 
Viaarxiv icon

Scene Parsing via Dense Recurrent Neural Networks with Attentional Selection

Nov 09, 2018
Heng Fan, Peng Chu, Longin Jan Latecki, Haibin Ling

Figure 1 for Scene Parsing via Dense Recurrent Neural Networks with Attentional Selection
Figure 2 for Scene Parsing via Dense Recurrent Neural Networks with Attentional Selection
Figure 3 for Scene Parsing via Dense Recurrent Neural Networks with Attentional Selection
Figure 4 for Scene Parsing via Dense Recurrent Neural Networks with Attentional Selection

Recurrent neural networks (RNNs) have shown the ability to improve scene parsing through capturing long-range dependencies among image units. In this paper, we propose dense RNNs for scene labeling by exploring various long-range semantic dependencies among image units. Different from existing RNN based approaches, our dense RNNs are able to capture richer contextual dependencies for each image unit by enabling immediate connections between each pair of image units, which significantly enhances their discriminative power. Besides, to select relevant dependencies and meanwhile to restrain irrelevant ones for each unit from dense connections, we introduce an attention model into dense RNNs. The attention model allows automatically assigning more importance to helpful dependencies while less weight to unconcerned dependencies. Integrating with convolutional neural networks (CNNs), we develop an end-to-end scene labeling system. Extensive experiments on three large-scale benchmarks demonstrate that the proposed approach can improve the baselines by large margins and outperform other state-of-the-art algorithms.

* 10 pages. arXiv admin note: substantial text overlap with arXiv:1801.06831 
Viaarxiv icon

A Single-shot-per-pose Camera-Projector Calibration System For Imperfect Planar Targets

Oct 17, 2018
Bingyao Huang, Samed Ozdemir, Ying Tang, Chunyuan Liao, Haibin Ling

Figure 1 for A Single-shot-per-pose Camera-Projector Calibration System For Imperfect Planar Targets
Figure 2 for A Single-shot-per-pose Camera-Projector Calibration System For Imperfect Planar Targets
Figure 3 for A Single-shot-per-pose Camera-Projector Calibration System For Imperfect Planar Targets
Figure 4 for A Single-shot-per-pose Camera-Projector Calibration System For Imperfect Planar Targets

Existing camera-projector calibration methods typically warp feature points from a camera image to a projector image using estimated homographies, and often suffer from errors in camera parameters and noise due to imperfect planarity of the calibration target. In this paper we propose a simple yet robust solution that explicitly deals with these challenges. Following the structured light (SL) camera-project calibration framework, a carefully designed correspondence algorithm is built on top of the De Bruijn patterns. Such correspondence is then used for initial camera-projector calibration. Then, to gain more robustness against noises, especially those from an imperfect planar calibration board, a bundle adjustment algorithm is developed to jointly optimize the estimated camera and projector models. Aside from the robustness, our solution requires only one shot of SL pattern for each calibration board pose, which is much more convenient than multi-shot solutions in practice. Data validations are conducted on both synthetic and real datasets, and our method shows clear advantages over existing methods in all experiments.

* Adjunct Proceedings of the IEEE International Symposium for Mixed and Augmented Reality 2018. Source code: https://github.com/BingyaoHuang/single-shot-pro-cam-calib/ 
Viaarxiv icon

LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking

Sep 20, 2018
Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, Haibin Ling

Figure 1 for LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking
Figure 2 for LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking
Figure 3 for LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking
Figure 4 for LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking

In this paper, we present LaSOT, a high-quality benchmark for Large-scale Single Object Tracking. LaSOT consists of 1,400 sequences with more than 3.5M frames in total. Each frame in these sequences is carefully and manually annotated with a bounding box, making LaSOT the largest, to the best of our knowledge, densely annotated tracking benchmark. The average sequence length of LaSOT is more than 2,500 frames, and each sequence comprises various challenges deriving from the wild where target objects may disappear and re-appear again in the view. By releasing LaSOT, we expect to provide the community a large-scale dedicated benchmark with high-quality for both the training of deep trackers and the veritable evaluation of tracking algorithms. Moreover, considering the close connections of visual appearance and natural language, we enrich LaSOT by providing additional language specification, aiming at encouraging the exploration of natural linguistic feature for tracking. A thorough experimental evaluation of 35 tracking algorithms on LaSOT is presented with detailed analysis, and the results demonstrate that there is still a big room to improvements. The benchmark and evaluation results are made publicly available at https://cis.temple.edu/lasot/.

* 17 pages, including supplementary material 
Viaarxiv icon

Privacy-Protective-GAN for Face De-identification

Jun 23, 2018
Yifan Wu, Fan Yang, Haibin Ling

Figure 1 for Privacy-Protective-GAN for Face De-identification
Figure 2 for Privacy-Protective-GAN for Face De-identification
Figure 3 for Privacy-Protective-GAN for Face De-identification
Figure 4 for Privacy-Protective-GAN for Face De-identification

Face de-identification has become increasingly important as the image sources are explosively growing and easily accessible. The advance of new face recognition techniques also arises people's concern regarding the privacy leakage. The mainstream pipelines of face de-identification are mostly based on the k-same framework, which bears critiques of low effectiveness and poor visual quality. In this paper, we propose a new framework called Privacy-Protective-GAN (PP-GAN) that adapts GAN with novel verificator and regulator modules specially designed for the face de-identification problem to ensure generating de-identified output with retained structure similarity according to a single input. We evaluate the proposed approach in terms of privacy protection, utility preservation, and structure similarity. Our approach not only outperforms existing face de-identification techniques but also provides a practical framework of adapting GAN with priors of domain knowledge.

Viaarxiv icon

Planar Object Tracking in the Wild: A Benchmark

May 22, 2018
Pengpeng Liang, Yifan Wu, Hu Lu, Liming Wang, Chunyuan Liao, Haibin Ling

Figure 1 for Planar Object Tracking in the Wild: A Benchmark
Figure 2 for Planar Object Tracking in the Wild: A Benchmark
Figure 3 for Planar Object Tracking in the Wild: A Benchmark
Figure 4 for Planar Object Tracking in the Wild: A Benchmark

Planar object tracking is an actively studied problem in vision-based robotic applications. While several benchmarks have been constructed for evaluating state-of-the-art algorithms, there is a lack of video sequences captured in the wild rather than in constrained laboratory environment. In this paper, we present a carefully designed planar object tracking benchmark containing 210 videos of 30 planar objects sampled in the natural environment. In particular, for each object, we shoot seven videos involving various challenging factors, namely scale change, rotation, perspective distortion, motion blur, occlusion, out-of-view, and unconstrained. The ground truth is carefully annotated semi-manually to ensure the quality. Moreover, eleven state-of-the-art algorithms are evaluated on the benchmark using two evaluation metrics, with detailed analysis provided for the evaluation results. We expect the proposed benchmark to benefit future studies on planar object tracking.

* Accepted by ICRA 2018 
Viaarxiv icon

Robust and Efficient Graph Correspondence Transfer for Person Re-identification

May 15, 2018
Qin Zhou, Heng Fan, Hua Yang, Hang Su, Shibao Zheng, Shuang Wu, Haibin Ling

Figure 1 for Robust and Efficient Graph Correspondence Transfer for Person Re-identification
Figure 2 for Robust and Efficient Graph Correspondence Transfer for Person Re-identification
Figure 3 for Robust and Efficient Graph Correspondence Transfer for Person Re-identification
Figure 4 for Robust and Efficient Graph Correspondence Transfer for Person Re-identification

Spatial misalignment caused by variations in poses and viewpoints is one of the most critical issues that hinders the performance improvement in existing person re-identification (Re-ID) algorithms. To address this problem, in this paper, we present a robust and efficient graph correspondence transfer (REGCT) approach for explicit spatial alignment in Re-ID. Specifically, we propose to establish the patch-wise correspondences of positive training pairs via graph matching. By exploiting both spatial and visual contexts of human appearance in graph matching, meaningful semantic correspondences can be obtained. To circumvent the cumbersome \emph{on-line} graph matching in testing phase, we propose to transfer the \emph{off-line} learned patch-wise correspondences from the positive training pairs to test pairs. In detail, for each test pair, the training pairs with similar pose-pair configurations are selected as references. The matching patterns (i.e., the correspondences) of the selected references are then utilized to calculate the patch-wise feature distances of this test pair. To enhance the robustness of correspondence transfer, we design a novel pose context descriptor to accurately model human body configurations, and present an approach to measure the similarity between a pair of pose context descriptors. Meanwhile, to improve testing efficiency, we propose a correspondence template ensemble method using the voting mechanism, which significantly reduces the amount of patch-wise matchings involved in distance calculation. With aforementioned strategies, the REGCT model can effectively and efficiently handle the spatial misalignment problem in Re-ID. Extensive experiments on five challenging benchmarks, including VIPeR, Road, PRID450S, 3DPES and CUHK01, evidence the superior performance of REGCT over other state-of-the-art approaches.

* Tech. Report. The source code is available at http://www.dabi.temple.edu/~hbling/code/gct.htm. arXiv admin note: text overlap with arXiv:1804.00242 
Viaarxiv icon

MTFH: A Matrix Tri-Factorization Hashing Framework for Efficient Cross-Modal Retrieval

May 04, 2018
Xin Liu, Zhikai Hu, Haibin Ling, Yiu-ming Cheung

Figure 1 for MTFH: A Matrix Tri-Factorization Hashing Framework for Efficient Cross-Modal Retrieval
Figure 2 for MTFH: A Matrix Tri-Factorization Hashing Framework for Efficient Cross-Modal Retrieval
Figure 3 for MTFH: A Matrix Tri-Factorization Hashing Framework for Efficient Cross-Modal Retrieval
Figure 4 for MTFH: A Matrix Tri-Factorization Hashing Framework for Efficient Cross-Modal Retrieval

Hashing has recently sparked a great revolution in cross-modal retrieval due to its low storage cost and high query speed. Most existing cross-modal hashing methods learn unified hash codes in a common Hamming space to represent all multi-modal data and make them intuitively comparable. However, such unified hash codes could inherently sacrifice their representation scalability because the data from different modalities may not have one-to-one correspondence and could be stored more efficiently by different hash codes of unequal lengths. To mitigate this problem, this paper proposes a generalized and flexible cross-modal hashing framework, termed Matrix Tri-Factorization Hashing (MTFH), which not only preserves the semantic similarity between the multi-modal data points, but also works seamlessly in various settings including paired or unpaired multi-modal data, and equal or varying hash length encoding scenarios. Specifically, MTFH exploits an efficient objective function to jointly learn the flexible modality-specific hash codes with different length settings, while simultaneously excavating two semantic correlation matrices to ensure heterogeneous data comparable. As a result, the derived hash codes are more semantically meaningful for various challenging cross-modal retrieval tasks. Extensive experiments evaluated on public benchmark datasets highlight the superiority of MTFH under various retrieval scenarios and show its very competitive performance with the state-of-the-arts.

* 14 pages, submitted to IEEE Journal 
Viaarxiv icon

Weighted Bilinear Coding over Salient Body Parts for Person Re-identification

Apr 30, 2018
Qin Zhou, Heng Fan, Hang Su, Hua Yang, Shibao Zheng, Haibin Ling

Figure 1 for Weighted Bilinear Coding over Salient Body Parts for Person Re-identification
Figure 2 for Weighted Bilinear Coding over Salient Body Parts for Person Re-identification
Figure 3 for Weighted Bilinear Coding over Salient Body Parts for Person Re-identification
Figure 4 for Weighted Bilinear Coding over Salient Body Parts for Person Re-identification

Deep convolutional neural networks (CNNs) have demonstrated dominant performance in person re-identification (Re-ID). Existing CNN based methods utilize global average pooling (GAP) to aggregate intermediate convolutional features for Re-ID. However, this strategy only considers the first-order statistics of local features and treats local features at different locations equally important, leading to sub-optimal feature representation. To deal with these issues, we propose a novel \emph{weighted bilinear coding} (WBC) model for local feature aggregation in CNN networks to pursue more representative and discriminative feature representations. In specific, bilinear coding is used to encode the channel-wise feature correlations to capture richer feature interactions. Meanwhile, a weighting scheme is applied on the bilinear coding to adaptively adjust the weights of local features at different locations based on their importance in recognition, further improving the discriminability of feature aggregation. To handle the spatial misalignment issue, we use a salient part net to derive salient body parts, and apply the WBC model on each part. The final representation, formed by concatenating the WBC eoncoded features of each part, is both discriminative and resistant to spatial misalignment. Experiments on three benchmarks including Market-1501, DukeMTMC-reID and CUHK03 evidence the favorable performance of our method against other state-of-the-art methods.

* This manuscript is under consideration at Pattern Recognition Letters 
Viaarxiv icon

Vision Meets Drones: A Challenge

Apr 23, 2018
Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Ling, Qinghua Hu

Figure 1 for Vision Meets Drones: A Challenge
Figure 2 for Vision Meets Drones: A Challenge
Figure 3 for Vision Meets Drones: A Challenge
Figure 4 for Vision Meets Drones: A Challenge

In this paper we present a large-scale visual object detection and tracking benchmark, named VisDrone2018, aiming at advancing visual understanding tasks on the drone platform. The images and video sequences in the benchmark were captured over various urban/suburban areas of 14 different cities across China from north to south. Specifically, VisDrone2018 consists of 263 video clips and 10,209 images (no overlap with video clips) with rich annotations, including object bounding boxes, object categories, occlusion, truncation ratios, etc. With intensive amount of effort, our benchmark has more than 2.5 million annotated instances in 179,264 images/video frames. Being the largest such dataset ever published, the benchmark enables extensive evaluation and investigation of visual analysis algorithms on the drone platform. In particular, we design four popular tasks with the benchmark, including object detection in images, object detection in videos, single object tracking, and multi-object tracking. All these tasks are extremely challenging in the proposed dataset due to factors such as occlusion, large scale and pose variation, and fast motion. We hope the benchmark largely boost the research and development in visual analysis on drone platforms.

* 11 pages, 11 figures 
Viaarxiv icon