Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Object Detection": models, code, and papers

MultiResolution Attention Extractor for Small Object Detection

Jun 10, 2020
Fan Zhang, Licheng Jiao, Lingling Li, Fang Liu, Xu Liu

Small objects are difficult to detect because of their low resolution and small size. The existing small object detection methods mainly focus on data preprocessing or narrowing the differences between large and small objects. Inspired by human vision "attention" mechanism, we exploit two feature extraction methods to mine the most useful information of small objects. Both methods are based on multiresolution feature extraction. We initially design and explore the soft attention method, but we find that its convergence speed is slow. Then we present the second method, an attention-based feature interaction method, called a MultiResolution Attention Extractor (MRAE), showing significant improvement as a generic feature extractor in small object detection. After each building block in the vanilla feature extractor, we append a small network to generate attention weights followed by a weighted-sum operation to get the final attention maps. Our attention-based feature extractor is 2.0 times the AP of the "hard" attention counterpart (plain architecture) on the COCO small object detection benchmark, proving that MRAE can capture useful location and contextual information through adaptive learning.

* 11 pages, 5 figures 

Comparison of object detection methods for crop damage assessment using deep learning

Dec 31, 2019
Ali HamidiSepehr, Seyed Vahid Mirnezami, James Ward

Severe weather events can cause large financial losses to farmers. Detailed information on the location and severity of damage will assist farmers, insurance companies, and disaster response agencies in making wise post-damage decisions. The goal of this study was a proof-of-concept to detect damaged crop areas from aerial imagery using computer vision and deep learning techniques. A specific objective was to compare existing object detection algorithms to determine which was best suited for crop damage detection. Two modes of crop damage common in maize (corn) production were simulated: stalk lodging at the lowest ear and stalk lodging at ground level. Simulated damage was used to create a training and analysis data set. An unmanned aerial system (UAS) equipped with a RGB camera was used for image acquisition. Three popular object detectors (Faster R-CNN, YOLOv2, and RetinaNet) were assessed for their ability to detect damaged regions in a field. Average precision was used to compare object detectors. YOLOv2 and RetinaNet were able to detect crop damage across multiple late-season growth stages. Faster R-CNN was not successful as the other two advanced detectors. Detecting crop damage at later growth stages was more difficult for all tested object detectors. Weed pressure in simulated damage plots and increased target density added additional complexity.


Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection

Jul 29, 2021
Yinmin Zhang, Xinzhu Ma, Shuai Yi, Jun Hou, Zhihui Wang, Wanli Ouyang, Dan Xu

As a crucial task of autonomous driving, 3D object detection has made great progress in recent years. However, monocular 3D object detection remains a challenging problem due to the unsatisfactory performance in depth estimation. Most existing monocular methods typically directly regress the scene depth while ignoring important relationships between the depth and various geometric elements (e.g. bounding box sizes, 3D object dimensions, and object poses). In this paper, we propose to learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection. Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised. We further implement and embed the proposed formula to enable geometry-aware deep representation learning, allowing effective 2D and 3D interactions for boosting the depth estimation. Moreover, we provide a strong baseline through addressing substantial misalignment between 2D annotation and projected boxes to ensure robust learning with the proposed geometric formula. Experiments on the KITTI dataset show that our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting. The model and code will be released at

* 16 pages, 11 figures 

ClusterNet: Detecting Small Objects in Large Scenes by Exploiting Spatio-Temporal Information

Dec 04, 2017
Rodney LaLonde, Dong Zhang, Mubarak Shah

Object detection in wide area motion imagery (WAMI) has drawn the attention of the computer vision research community for a number of years. WAMI proposes a number of unique challenges including extremely small object sizes, both sparse and densely-packed objects, and extremely large search spaces (large video frames). Nearly all state-of-the-art methods in WAMI object detection report that appearance-based classifiers fail in this challenging data and instead rely almost entirely on motion information in the form of background subtraction or frame-differencing. In this work, we experimentally verify the failure of appearance-based classifiers in WAMI, such as Faster R-CNN and a heatmap-based fully convolutional neural network (CNN), and propose a novel two-stage spatio-temporal CNN which effectively and efficiently combines both appearance and motion information to significantly surpass the state-of-the-art in WAMI object detection. To reduce the large search space, the first stage (ClusterNet) takes in a set of extremely large video frames, combines the motion and appearance information within the convolutional architecture, and proposes regions of objects of interest (ROOBI). These ROOBI can contain from one to clusters of several hundred objects due to the large video frame size and varying object density in WAMI. The second stage (FoveaNet) then estimates the centroid location of all objects in that given ROOBI simultaneously via heatmap estimation. The proposed method exceeds state-of-the-art results on the WPAFB 2009 dataset by 5-16% for moving objects and nearly 50% for stopped objects, as well as being the first proposed method in wide area motion imagery to detect completely stationary objects.

* Main paper is 8 pages. Supplemental section contains a walk-through of our method (using a qualitative example) and qualitative results for WPAFB 2009 dataset 

AdaZoom: Adaptive Zoom Network for Multi-Scale Object Detection in Large Scenes

Jun 19, 2021
Jingtao Xu, Yali Li, Shengjin Wang

Detection in large-scale scenes is a challenging problem due to small objects and extreme scale variation. It is essential to focus on the image regions of small objects. In this paper, we propose a novel Adaptive Zoom (AdaZoom) network as a selective magnifier with flexible shape and focal length to adaptively zoom the focus regions for object detection. Based on policy gradient, we construct a reinforcement learning framework for focus region generation, with the reward formulated by object distributions. The scales and aspect ratios of the generated regions are adaptive to the scales and distribution of objects inside. We apply variable magnification according to the scale of the region for adaptive multi-scale detection. We further propose collaborative training to complementarily promote the performance of AdaZoom and the detection network. To validate the effectiveness, we conduct extensive experiments on VisDrone2019, UAVDT, and DOTA datasets. The experiments show AdaZoom brings a consistent and significant improvement over different detection networks, achieving state-of-the-art performance on these datasets, especially outperforming the existing methods by AP of 4.64% on Vis-Drone2019.


RBGNet: Ray-based Grouping for 3D Object Detection

Apr 05, 2022
Haiyang Wang, Shaoshuai Shi, Ze Yang, Rongyao Fang, Qi Qian, Hongsheng Li, Bernt Schiele, Liwei Wang

As a fundamental problem in computer vision, 3D object detection is experiencing rapid growth. To extract the point-wise features from the irregularly and sparsely distributed points, previous methods usually take a feature grouping module to aggregate the point features to an object candidate. However, these methods have not yet leveraged the surface geometry of foreground objects to enhance grouping and 3D box generation. In this paper, we propose the RBGNet framework, a voting-based 3D detector for accurate 3D object detection from point clouds. In order to learn better representations of object shape to enhance cluster features for predicting 3D boxes, we propose a ray-based feature grouping module, which aggregates the point-wise features on object surfaces using a group of determined rays uniformly emitted from cluster centers. Considering the fact that foreground points are more meaningful for box estimation, we design a novel foreground biased sampling strategy in downsample process to sample more points on object surfaces and further boost the detection performance. Our model achieves state-of-the-art 3D detection performance on ScanNet V2 and SUN RGB-D with remarkable performance gains. Code will be available at


Neighbor-Vote: Improving Monocular 3D Object Detection through Neighbor Distance Voting

Jul 06, 2021
Xiaomeng Chu, Jiajun Deng, Yao Li, Zhenxun Yuan, Yanyong Zhang, Jianmin Ji, Yu Zhang

As cameras are increasingly deployed in new application domains such as autonomous driving, performing 3D object detection on monocular images becomes an important task for visual scene understanding. Recent advances on monocular 3D object detection mainly rely on the ``pseudo-LiDAR'' generation, which performs monocular depth estimation and lifts the 2D pixels to pseudo 3D points. However, depth estimation from monocular images, due to its poor accuracy, leads to inevitable position shift of pseudo-LiDAR points within the object. Therefore, the predicted bounding boxes may suffer from inaccurate location and deformed shape. In this paper, we present a novel neighbor-voting method that incorporates neighbor predictions to ameliorate object detection from severely deformed pseudo-LiDAR point clouds. Specifically, each feature point around the object forms their own predictions, and then the ``consensus'' is achieved through voting. In this way, we can effectively combine the neighbors' predictions with local prediction and achieve more accurate 3D detection. To further enlarge the difference between the foreground region of interest (ROI) pseudo-LiDAR points and the background points, we also encode the ROI prediction scores of 2D foreground pixels into the corresponding pseudo-LiDAR points. We conduct extensive experiments on the KITTI benchmark to validate the merits of our proposed method. Our results on the bird's eye view detection outperform the state-of-the-art performance by a large margin, especially for the ``hard'' level detection.

* Accepted by ACM Multimedia 2021 

Development of An Android Application for Object Detection Based on Color, Shape, or Local Features

Mar 10, 2017
Lamiaa A. Elrefaei, Mona Omar Al-musawa, Norah Abdullah Al-gohany

Object detection and recognition is an important task in many computer vision applications. In this paper an Android application was developed using Eclipse IDE and OpenCV3 Library. This application is able to detect objects in an image that is loaded from the mobile gallery, based on its color, shape, or local features. The image is processed in the HSV color domain for better color detection. Circular shapes are detected using Circular Hough Transform and other shapes are detected using Douglas-Peucker algorithm. BRISK (binary robust invariant scalable keypoints) local features were applied in the developed Android application for matching an object image in another scene image. The steps of the proposed detection algorithms are described, and the interfaces of the application are illustrated. The application is ported and tested on Galaxy S3, S6, and Note1 Smartphones. Based on the experimental results, the application is capable of detecting eleven different colors, detecting two dimensional geometrical shapes including circles, rectangles, triangles, and squares, and correctly match local features of object and scene images for different conditions. The application could be used as a standalone application, or as a part of another application such as Robot systems, traffic systems, e-learning applications, information retrieval and many others.

* The International Journal of Multimedia & Its Applications (IJMA) Vol.9, No.1, February 2017 

Light-Weight RetinaNet for Object Detection

May 24, 2019
Yixing Li, Fengbo Ren

Object detection has gained great progress driven by the development of deep learning. Compared with a widely studied task -- classification, generally speaking, object detection even need one or two orders of magnitude more FLOPs (floating point operations) in processing the inference task. To enable a practical application, it is essential to explore effective runtime and accuracy trade-off scheme. Recently, a growing number of studies are intended for object detection on resource constraint devices, such as YOLOv1, YOLOv2, SSD, MobileNetv2-SSDLite, whose accuracy on COCO test-dev detection results are yield to mAP around 22-25% (mAP-20-tier). On the contrary, very few studies discuss the computation and accuracy trade-off scheme for mAP-30-tier detection networks. In this paper, we illustrate the insights of why RetinaNet gives effective computation and accuracy trade-off for object detection and how to build a light-weight RetinaNet. We propose to only reduce FLOPs in computational intensive layers and keep other layer the same. Compared with most common way -- input image scaling for FLOPs-accuracy trade-off, the proposed solution shows a constantly better FLOPs-mAP trade-off line. Quantitatively, the proposed method result in 0.1% mAP improvement at 1.15x FLOPs reduction and 0.3% mAP improvement at 1.8x FLOPs reduction.


Relation Distillation Networks for Video Object Detection

Aug 26, 2019
Jiajun Deng, Yingwei Pan, Ting Yao, Wengang Zhou, Houqiang Li, Tao Mei

It has been well recognized that modeling object-to-object relations would be helpful for object detection. Nevertheless, the problem is not trivial especially when exploring the interactions between objects to boost video object detectors. The difficulty originates from the aspect that reliable object relations in a video should depend on not only the objects in the present frame but also all the supportive objects extracted over a long range span of the video. In this paper, we introduce a new design to capture the interactions across the objects in spatio-temporal context. Specifically, we present Relation Distillation Networks (RDN) --- a new architecture that novelly aggregates and propagates object relation to augment object features for detection. Technically, object proposals are first generated via Region Proposal Networks (RPN). RDN then, on one hand, models object relation via multi-stage reasoning, and on the other, progressively distills relation through refining supportive object proposals with high objectness scores in a cascaded manner. The learnt relation verifies the efficacy on both improving object detection in each frame and box linking across frames. Extensive experiments are conducted on ImageNet VID dataset, and superior results are reported when comparing to state-of-the-art methods. More remarkably, our RDN achieves 81.8% and 83.2% mAP with ResNet-101 and ResNeXt-101, respectively. When further equipped with linking and rescoring, we obtain to-date the best reported mAP of 83.8% and 84.7%.

* ICCV 2019