Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Object Detection": models, code, and papers

SyNet: An Ensemble Network for Object Detection in UAV Images

Dec 23, 2020
Berat Mert Albaba, Sedat Ozer

Recent advances in camera equipped drone applications and their widespread use increased the demand on vision based object detection algorithms for aerial images. Object detection process is inherently a challenging task as a generic computer vision problem, however, since the use of object detection algorithms on UAVs (or on drones) is relatively a new area, it remains as a more challenging problem to detect objects in aerial images. There are several reasons for that including: (i) the lack of large drone datasets including large object variance, (ii) the large orientation and scale variance in drone images when compared to the ground images, and (iii) the difference in texture and shape features between the ground and the aerial images. Deep learning based object detection algorithms can be classified under two main categories: (a) single-stage detectors and (b) multi-stage detectors. Both single-stage and multi-stage solutions have their advantages and disadvantages over each other. However, a technique to combine the good sides of each of those solutions could yield even a stronger solution than each of those solutions individually. In this paper, we propose an ensemble network, SyNet, that combines a multi-stage method with a single-stage one with the motivation of decreasing the high false negative rate of multi-stage detectors and increasing the quality of the single-stage detector proposals. As building blocks, CenterNet and Cascade R-CNN with pretrained feature extractors are utilized along with an ensembling strategy. We report the state of the art results obtained by our proposed solution on two different datasets: namely MS-COCO and visDrone with \%52.1 $mAP_{IoU = 0.75}$ is obtained on MS-COCO $val2017$ dataset and \%26.2 $mAP_{IoU = 0.75}$ is obtained on VisDrone $test-set$.

* accepted for publication at ICPR 2020 
Access Paper or Ask Questions

Anchor Retouching via Model Interaction for Robust Object Detection in Aerial Images

Dec 13, 2021
Dong Liang, Qixiang Geng, Zongqi Wei, Dmitry A. Vorontsov, Ekaterina L. Kim, Mingqiang Wei, Huiyu Zhou

Object detection has made tremendous strides in computer vision. Small object detection with appearance degradation is a prominent challenge, especially for aerial observations. To collect sufficient positive/negative samples for heuristic training, most object detectors preset region anchors in order to calculate Intersection-over-Union (IoU) against the ground-truthed data. In this case, small objects are frequently abandoned or mislabeled. In this paper, we present an effective Dynamic Enhancement Anchor (DEA) network to construct a novel training sample generator. Different from the other state-of-the-art techniques, the proposed network leverages a sample discriminator to realize interactive sample screening between an anchor-based unit and an anchor-free unit to generate eligible samples. Besides, multi-task joint training with a conservative anchor-based inference scheme enhances the performance of the proposed model while reducing computational complexity. The proposed scheme supports both oriented and horizontal object detection tasks. Extensive experiments on two challenging aerial benchmarks (i.e., DOTA and HRSC2016) indicate that our method achieves state-of-the-art performance in accuracy with moderate inference speed and computational overhead for training. On DOTA, our DEA-Net which integrated with the baseline of RoI-Transformer surpasses the advanced method by 0.40% mean-Average-Precision (mAP) for oriented object detection with a weaker backbone network (ResNet-101 vs ResNet-152) and 3.08% mean-Average-Precision (mAP) for horizontal object detection with the same backbone. Besides, our DEA-Net which integrated with the baseline of ReDet achieves the state-of-the-art performance by 80.37%. On HRSC2016, it surpasses the previous best model by 1.1% using only 3 horizontal anchors.

Access Paper or Ask Questions

Detection Recovery in Online Multi-Object Tracking with Sparse Graph Tracker

May 02, 2022
Jeongseok Hyun, Myunggu Kang, Dongyoon Wee, Dit-Yan Yeung

Joint object detection and online multi-object tracking (JDT) methods have been proposed recently to achieve one-shot tracking. Yet, existing works overlook the importance of detection itself and often result in missed detections when confronted by occlusions or motion blurs. The missed detections affect not only detection performance but also tracking performance due to inconsistent tracklets. Hence, we propose a new JDT model that recovers the missed detections while associating the detection candidates of consecutive frames by learning object-level spatio-temporal consistency through edge features in a Graph Neural Network (GNN). Our proposed model Sparse Graph Tracker (SGT) converts video data into a graph, where the nodes are top-$K$ scored detection candidates, and the edges are relations between the nodes at different times, such as position difference and visual similarity. Two nodes are connected if they are close in either a Euclidean or feature space, generating a sparsely connected graph. Without motion prediction or Re-Identification (ReID), the association is performed by predicting an edge score representing the probability that two connected nodes refer to the same object. Under the online setting, our SGT achieves state-of-the-art (SOTA) on the MOT17/20 Detection and MOT16/20 benchmarks in terms of AP and MOTA, respectively. Especially, SGT surpasses the previous SOTA on the crowded dataset MOT20 where partial occlusion cases are dominant, showing the effectiveness of detection recovery against partial occlusion. Code will be released at

Access Paper or Ask Questions

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Jan 06, 2016
Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features---using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.

* Extended tech report 
Access Paper or Ask Questions

A MultiPath Network for Object Detection

Aug 08, 2016
Sergey Zagoruyko, Adam Lerer, Tsung-Yi Lin, Pedro O. Pinheiro, Sam Gross, Soumith Chintala, Piotr Dollár

The recent COCO object detection dataset presents several new challenges for object detection. In particular, it contains objects at a broad range of scales, less prototypical images, and requires more precise localization. To address these challenges, we test three modifications to the standard Fast R-CNN object detector: (1) skip connections that give the detector access to features at multiple network layers, (2) a foveal structure to exploit object context at multiple object resolutions, and (3) an integral loss function and corresponding network adjustment that improve localization. The result of these modifications is that information can flow along multiple paths in our network, including through features from multiple network layers and from multiple object views. We refer to our modified classifier as a "MultiPath" network. We couple our MultiPath network with DeepMask object proposals, which are well suited for localization and small objects, and adapt our pipeline to predict segmentation masks in addition to bounding boxes. The combined system improves results over the baseline Fast R-CNN detector with Selective Search by 66% overall and by 4x on small objects. It placed second in both the COCO 2015 detection and segmentation challenges.

Access Paper or Ask Questions

Learning High-Precision Bounding Box for Rotated Object Detection via Kullback-Leibler Divergence

Jun 04, 2021
Xue Yang, Xiaojiang Yang, Jirui Yang, Qi Ming, Wentao Wang, Qi Tian, Junchi Yan

Existing rotated object detectors are mostly inherited from the horizontal detection paradigm, as the latter has evolved into a well-developed area. However, these detectors are difficult to perform prominently in high-precision detection due to the limitation of current regression loss design, especially for objects with large aspect ratios. Taking the perspective that horizontal detection is a special case for rotated object detection, in this paper, we are motivated to change the design of rotation regression loss from induction paradigm to deduction methodology, in terms of the relation between rotation and horizontal detection. We show that one essential challenge is how to modulate the coupled parameters in the rotation regression loss, as such the estimated parameters can influence to each other during the dynamic joint optimization, in an adaptive and synergetic way. Specifically, we first convert the rotated bounding box into a 2-D Gaussian distribution, and then calculate the Kullback-Leibler Divergence (KLD) between the Gaussian distributions as the regression loss. By analyzing the gradient of each parameter, we show that KLD (and its derivatives) can dynamically adjust the parameter gradients according to the characteristics of the object. It will adjust the importance (gradient weight) of the angle parameter according to the aspect ratio. This mechanism can be vital for high-precision detection as a slight angle error would cause a serious accuracy drop for large aspect ratios objects. More importantly, we have proved that KLD is scale invariant. We further show that the KLD loss can be degenerated into the popular $l_{n}$-norm loss for horizontal detection. Experimental results on seven datasets using different detectors show its consistent superiority, and codes are available at

* 15 pages, 5 figures, 7 tables 
Access Paper or Ask Questions

MultiResolution Attention Extractor for Small Object Detection

Jun 10, 2020
Fan Zhang, Licheng Jiao, Lingling Li, Fang Liu, Xu Liu

Small objects are difficult to detect because of their low resolution and small size. The existing small object detection methods mainly focus on data preprocessing or narrowing the differences between large and small objects. Inspired by human vision "attention" mechanism, we exploit two feature extraction methods to mine the most useful information of small objects. Both methods are based on multiresolution feature extraction. We initially design and explore the soft attention method, but we find that its convergence speed is slow. Then we present the second method, an attention-based feature interaction method, called a MultiResolution Attention Extractor (MRAE), showing significant improvement as a generic feature extractor in small object detection. After each building block in the vanilla feature extractor, we append a small network to generate attention weights followed by a weighted-sum operation to get the final attention maps. Our attention-based feature extractor is 2.0 times the AP of the "hard" attention counterpart (plain architecture) on the COCO small object detection benchmark, proving that MRAE can capture useful location and contextual information through adaptive learning.

* 11 pages, 5 figures 
Access Paper or Ask Questions

Comparison of object detection methods for crop damage assessment using deep learning

Dec 31, 2019
Ali HamidiSepehr, Seyed Vahid Mirnezami, James Ward

Severe weather events can cause large financial losses to farmers. Detailed information on the location and severity of damage will assist farmers, insurance companies, and disaster response agencies in making wise post-damage decisions. The goal of this study was a proof-of-concept to detect damaged crop areas from aerial imagery using computer vision and deep learning techniques. A specific objective was to compare existing object detection algorithms to determine which was best suited for crop damage detection. Two modes of crop damage common in maize (corn) production were simulated: stalk lodging at the lowest ear and stalk lodging at ground level. Simulated damage was used to create a training and analysis data set. An unmanned aerial system (UAS) equipped with a RGB camera was used for image acquisition. Three popular object detectors (Faster R-CNN, YOLOv2, and RetinaNet) were assessed for their ability to detect damaged regions in a field. Average precision was used to compare object detectors. YOLOv2 and RetinaNet were able to detect crop damage across multiple late-season growth stages. Faster R-CNN was not successful as the other two advanced detectors. Detecting crop damage at later growth stages was more difficult for all tested object detectors. Weed pressure in simulated damage plots and increased target density added additional complexity.

Access Paper or Ask Questions

Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection

Jul 29, 2021
Yinmin Zhang, Xinzhu Ma, Shuai Yi, Jun Hou, Zhihui Wang, Wanli Ouyang, Dan Xu

As a crucial task of autonomous driving, 3D object detection has made great progress in recent years. However, monocular 3D object detection remains a challenging problem due to the unsatisfactory performance in depth estimation. Most existing monocular methods typically directly regress the scene depth while ignoring important relationships between the depth and various geometric elements (e.g. bounding box sizes, 3D object dimensions, and object poses). In this paper, we propose to learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection. Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised. We further implement and embed the proposed formula to enable geometry-aware deep representation learning, allowing effective 2D and 3D interactions for boosting the depth estimation. Moreover, we provide a strong baseline through addressing substantial misalignment between 2D annotation and projected boxes to ensure robust learning with the proposed geometric formula. Experiments on the KITTI dataset show that our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting. The model and code will be released at

* 16 pages, 11 figures 
Access Paper or Ask Questions