Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Object Detection": models, code, and papers

Exploiting Web Images for Weakly Supervised Object Detection

Jul 28, 2017
Qingyi Tao, Hao Yang, Jianfei Cai

In recent years, the performance of object detection has advanced significantly with the evolving deep convolutional neural networks. However, the state-of-the-art object detection methods still rely on accurate bounding box annotations that require extensive human labelling. Object detection without bounding box annotations, i.e, weakly supervised detection methods, are still lagging far behind. As weakly supervised detection only uses image level labels and does not require the ground truth of bounding box location and label of each object in an image, it is generally very difficult to distill knowledge of the actual appearances of objects. Inspired by curriculum learning, this paper proposes an easy-to-hard knowledge transfer scheme that incorporates easy web images to provide prior knowledge of object appearance as a good starting point. While exploiting large-scale free web imagery, we introduce a sophisticated labour free method to construct a web dataset with good diversity in object appearance. After that, semantic relevance and distribution relevance are introduced and utilized in the proposed curriculum training scheme. Our end-to-end learning with the constructed web data achieves remarkable improvement across most object classes especially for the classes that are often considered hard in other works.


Labels Are Not Perfect: Inferring Spatial Uncertainty in Object Detection

Dec 18, 2020
Di Feng, Zining Wang, Yiyang Zhou, Lars Rosenbaum, Fabian Timm, Klaus Dietmayer, Masayoshi Tomizuka, Wei Zhan

The availability of many real-world driving datasets is a key reason behind the recent progress of object detection algorithms in autonomous driving. However, there exist ambiguity or even failures in object labels due to error-prone annotation process or sensor observation noise. Current public object detection datasets only provide deterministic object labels without considering their inherent uncertainty, as does the common training process or evaluation metrics for object detectors. As a result, an in-depth evaluation among different object detection methods remains challenging, and the training process of object detectors is sub-optimal, especially in probabilistic object detection. In this work, we infer the uncertainty in bounding box labels from LiDAR point clouds based on a generative model, and define a new representation of the probabilistic bounding box through a spatial uncertainty distribution. Comprehensive experiments show that the proposed model reflects complex environmental noises in LiDAR perception and the label quality. Furthermore, we propose Jaccard IoU (JIoU) as a new evaluation metric that extends IoU by incorporating label uncertainty. We conduct an in-depth comparison among several LiDAR-based object detectors using the JIoU metric. Finally, we incorporate the proposed label uncertainty in a loss function to train a probabilistic object detector and to improve its detection accuracy. We verify our proposed methods on two public datasets (KITTI, Waymo), as well as on simulation data. Code is released at


Detecting and Tracking Small Moving Objects in Wide Area Motion Imagery (WAMI) Using Convolutional Neural Networks (CNNs)

Nov 08, 2019
Yifan Zhou, Simon Maskell

This paper proposes an approach to detect moving objects in Wide Area Motion Imagery (WAMI), in which the objects are both small and well separated. Identifying the objects only using foreground appearance is difficult since a $100-$pixel vehicle is hard to distinguish from objects comprising the background. Our approach is based on background subtraction as an efficient and unsupervised method that is able to output the shape of objects. In order to reliably detect low contrast and small objects, we configure the background subtraction to extract foreground regions that might be objects of interest. While this dramatically increases the number of false alarms, a Convolutional Neural Network (CNN) considering both spatial and temporal information is then trained to reject the false alarms. In areas with heavy traffic, the background subtraction yields merged detections. To reduce the complexity of multi-target tracker needed, we train another CNN to predict the positions of multiple moving objects in an area. Our approach shows competitive detection performance on smaller objects relative to the state-of-the-art. We adopt a GM-PHD filter to associate detections over time and analyse the resulting performance.

* Accepted for publication in 22nd International Conference on Information Fusion (FUSION 2019) 

Object Detection under Rainy Conditions for Autonomous Vehicles

Jun 30, 2020
Mazin Hnewa, Hayder Radha

Advanced automotive active-safety systems, in general, and autonomous vehicles, in particular, rely heavily on visual data to classify and localize objects such as pedestrians, traffic signs and lights, and other nearby cars, to assist the corresponding vehicles maneuver safely in their environments. However, the performance of object detection methods could degrade rather significantly under challenging weather scenarios including rainy conditions. Despite major advancements in the development of deraining approaches, the impact of rain on object detection has largely been understudied, especially in the context of autonomous driving. The main objective of this paper is to present a tutorial on state-of-the-art and emerging techniques that represent leading candidates for mitigating the influence of rainy conditions on an autonomous vehicle\textsc{'}s ability to detect objects. Our goal includes surveying and analyzing the performance of object detection methods trained and tested using visual data captured under clear and rainy conditions. Moreover, we survey and evaluate the efficacy and limitations of leading deraining approaches, deep-learning based domain adaptation, and image translation frameworks that are being considered for addressing the problem of object detection under rainy conditions. Experimental results of a variety of the surveyed techniques are presented as part of this tutorial.


Center-based 3D Object Detection and Tracking

Jun 19, 2020
Tianwei Yin, Xingyi Zhou, Philipp Krähenbühl

Three-dimensional objects are commonly represented as 3D boxes in a point-cloud. This representation mimics the well-studied image-based 2D bounding-box detection but comes with additional challenges. Objects in a 3D world do not follow any particular orientation, and box-based detectors have difficulties enumerating all orientations or fitting an axis-aligned bounding box to rotated objects. In this paper, we instead propose to represent, detect, and track 3D objects as points. We use a keypoint detector to find centers of objects and simply regress to other attributes, including 3D size, 3D orientation, and velocity. In our center-based framework, 3D object tracking simplifies to greedy closest-point matching. The resulting detection and tracking algorithm is simple, efficient, and effective. On the nuScenes dataset, our point-based representations perform $3$-$4$ mAP higher than the box-based counterparts for 3D detection, and 6 AMOTA higher for 3D tracking. Our real-time model runs end-to-end 3D detection and tracking at $30$ FPS with $54.2$ AMOTA and $48.3$ mAP while the best single model achieves $60.3$ mAP for 3D detection and $63.8$ AMOTA for 3D tracking. The code and pretrained models are available at


Learning Object Scale With Click Supervision for Object Detection

Feb 20, 2020
Liao Zhang, Yan Yan, Lin Cheng, Hanzi Wang

Weakly-supervised object detection has recently attracted increasing attention since it only requires image-levelannotations. However, the performance obtained by existingmethods is still far from being satisfactory compared with fully-supervised object detection methods. To achieve a good trade-off between annotation cost and object detection performance,we propose a simple yet effective method which incorporatesCNN visualization with click supervision to generate the pseudoground-truths (i.e., bounding boxes). These pseudo ground-truthscan be used to train a fully-supervised detector. To estimatethe object scale, we firstly adopt a proposal selection algorithmto preserve high-quality proposals, and then generate ClassActivation Maps (CAMs) for these preserved proposals by theproposed CNN visualization algorithm called Spatial AttentionCAM. Finally, we fuse these CAMs together to generate pseudoground-truths and train a fully-supervised object detector withthese ground-truths. Experimental results on the PASCAL VOC2007 and VOC 2012 datasets show that the proposed methodcan obtain much higher accuracy for estimating the object scale,compared with the state-of-the-art image-level based methodsand the center-click based method


Weakly Supervised 3D Object Detection from Point Clouds

Jul 28, 2020
Zengyi Qin, Jinglu Wang, Yan Lu

A crucial task in scene understanding is 3D object detection, which aims to detect and localize the 3D bounding boxes of objects belonging to specific classes. Existing 3D object detectors heavily rely on annotated 3D bounding boxes during training, while these annotations could be expensive to obtain and only accessible in limited scenarios. Weakly supervised learning is a promising approach to reducing the annotation requirement, but existing weakly supervised object detectors are mostly for 2D detection rather than 3D. In this work, we propose VS3D, a framework for weakly supervised 3D object detection from point clouds without using any ground truth 3D bounding box for training. First, we introduce an unsupervised 3D proposal module that generates object proposals by leveraging normalized point cloud densities. Second, we present a cross-modal knowledge distillation strategy, where a convolutional neural network learns to predict the final results from the 3D object proposals by querying a teacher network pretrained on image datasets. Comprehensive experiments on the challenging KITTI dataset demonstrate the superior performance of our VS3D in diverse evaluation settings. The source code and pretrained models are publicly available at

* Accepted by ACM MM 2020 

Evaluation of a Dual Convolutional Neural Network Architecture for Object-wise Anomaly Detection in Cluttered X-ray Security Imagery

Apr 10, 2019
Yona Falinie A. Gaus, Neelanjan Bhowmik, Samet Akçay, Paolo M. Guillen-Garcia, Jack W. Barker, Toby P. Breckon

X-ray baggage security screening is widely used to maintain aviation and transport security. Of particular interest is the focus on automated security X-ray analysis for particular classes of object such as electronics, electrical items, and liquids. However, manual inspection of such items is challenging when dealing with potentially anomalous items. Here we present a dual convolutional neural network (CNN) architecture for automatic anomaly detection within complex security X-ray imagery. We leverage recent advances in region-based (R-CNN), mask-based CNN (Mask R-CNN) and detection architectures such as RetinaNet to provide object localisation variants for specific object classes of interest. Subsequently, leveraging a range of established CNN object and fine-grained category classification approaches we formulate within object anomaly detection as a two-class problem (anomalous or benign). While the best performing object localisation method is able to perform with 97.9% mean average precision (mAP) over a six-class X-ray object detection problem, subsequent two-class anomaly/benign classification is able to achieve 66% performance for within object anomaly detection. Overall, this performance illustrates both the challenge and promise of object-wise anomaly detection within the context of cluttered X-ray security imagery.

* IJCNN 2019 

Window-Object Relationship Guided Representation Learning for Generic Object Detections

Dec 09, 2015
Xingyu Zeng, Wanli Ouyang, Xiaogang Wang

In existing works that learn representation for object detection, the relationship between a candidate window and the ground truth bounding box of an object is simplified by thresholding their overlap. This paper shows information loss in this simplification and picks up the relative location/size information discarded by thresholding. We propose a representation learning pipeline to use the relationship as supervision for improving the learned representation in object detection. Such relationship is not limited to object of the target category, but also includes surrounding objects of other categories. We show that image regions with multiple contexts and multiple rotations are effective in capturing such relationship during the representation learning process and in handling the semantic and visual variation caused by different window-object configurations. Experimental results show that the representation learned by our approach can improve the object detection accuracy by 6.4% in mean average precision (mAP) on ILSVRC2014. On the challenging ILSVRC2014 test dataset, 48.6% mAP is achieved by our single model and it is the best among published results. On PASCAL VOC, it outperforms the state-of-the-art result of Fast RCNN by 3.3% in absolute mAP.

* 9 pages, including 1 reference page, 6 figures