Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Object Detection": models, code, and papers

TJU-DHD: A Diverse High-Resolution Dataset for Object Detection

Nov 18, 2020
Yanwei Pang, Jiale Cao, Yazhao Li, Jin Xie, Hanqing Sun, Jinfeng Gong

Vehicles, pedestrians, and riders are the most important and interesting objects for the perception modules of self-driving vehicles and video surveillance. However, the state-of-the-art performance of detecting such important objects (esp. small objects) is far from satisfying the demand of practical systems. Large-scale, rich-diversity, and high-resolution datasets play an important role in developing better object detection methods to satisfy the demand. Existing public large-scale datasets such as MS COCO collected from websites do not focus on the specific scenarios. Moreover, the popular datasets (e.g., KITTI and Citypersons) collected from the specific scenarios are limited in the number of images and instances, the resolution, and the diversity. To attempt to solve the problem, we build a diverse high-resolution dataset (called TJU-DHD). The dataset contains 115,354 high-resolution images (52% images have a resolution of 1624$\times$1200 pixels and 48% images have a resolution of at least 2,560$\times$1,440 pixels) and 709,330 labeled objects in total with a large variance in scale and appearance. Meanwhile, the dataset has a rich diversity in season variance, illumination variance, and weather variance. In addition, a new diverse pedestrian dataset is further built. With the four different detectors (i.e., the one-stage RetinaNet, anchor-free FCOS, two-stage FPN, and Cascade R-CNN), experiments about object detection and pedestrian detection are conducted. We hope that the newly built dataset can help promote the research on object detection and pedestrian detection in these two scenes. The dataset is available at

* IEEE Transactions on Image Processing, 2020 
* object detection and pedestrian detection. website: 

UC-OWOD: Unknown-Classified Open World Object Detection

Jul 23, 2022
Zhiheng Wu, Yue Lu, Xingyu Chen, Zhengxing Wu, Liwen Kang, Junzhi Yu

Open World Object Detection (OWOD) is a challenging computer vision problem that requires detecting unknown objects and gradually learning the identified unknown classes. However, it cannot distinguish unknown instances as multiple unknown classes. In this work, we propose a novel OWOD problem called Unknown-Classified Open World Object Detection (UC-OWOD). UC-OWOD aims to detect unknown instances and classify them into different unknown classes. Besides, we formulate the problem and devise a two-stage object detector to solve UC-OWOD. First, unknown label-aware proposal and unknown-discriminative classification head are used to detect known and unknown objects. Then, similarity-based unknown classification and unknown clustering refinement modules are constructed to distinguish multiple unknown classes. Moreover, two novel evaluation protocols are designed to evaluate unknown-class detection. Abundant experiments and visualizations prove the effectiveness of the proposed method. Code is available at

* Accepted to ECCV 2022 

Evaluation of YOLO Models with Sliced Inference for Small Object Detection

Mar 09, 2022
Muhammed Can Keles, Batuhan Salmanoglu, Mehmet Serdar Guzel, Baran Gursoy, Gazi Erkan Bostanci

Small object detection has major applications in the fields of UAVs, surveillance, farming and many others. In this work we investigate the performance of state of the art Yolo based object detection models for the task of small object detection as they are one of the most popular and easy to use object detection models. We evaluated YOLOv5 and YOLOX models in this study. We also investigate the effects of slicing aided inference and fine-tuning the model for slicing aided inference. We used the VisDrone2019Det dataset for training and evaluating our models. This dataset is challenging in the sense that most objects are relatively small compared to the image sizes. This work aims to benchmark the YOLOv5 and YOLOX models for small object detection. We have seen that sliced inference increases the AP50 score in all experiments, this effect was greater for the YOLOv5 models compared to the YOLOX models. The effects of sliced fine-tuning and sliced inference combined produced substantial improvement for all models. The highest AP50 score was achieved by the YOLOv5- Large model on the VisDrone2019Det test-dev subset with the score being 48.8.

* 6 pages, 5 figures 

Dual-Cross-Polarized GPR Measurement Method for Detection and Orientation Estimation of Shallowly Buried Elongated Object

May 17, 2022
Hai-Han Sun, Yee Hui Lee, Wenhao Luo, Lai Fern Ow, Mohamed Lokman Mohd Yusof, Abdulkadir C. Yucel

Detecting a shallowly buried and elongated object and estimating its orientation using a commonly adopted co-polarized GPR system is challenging due to the presence of strong ground clutter that masks the target reflection. A cross-polarized configuration can be used to suppress ground clutter and reveal the object reflection, but it suffers from inconsistent detection capability which significantly varies with different object orientations. To address this issue, we propose a dual-cross-polarized detection (DCPD) method which utilizes two cross-polarized antennas with a special arrangement to detect the object. The signals reflected by the object and collected by the two antennas are combined in a rotationally invariant manner to ensure both effective ground clutter suppression and consistent detection irrespective of the object orientation. In addition, we present a dual-cross-polarized orientation estimation (DCPOE) algorithm to estimate the object orientation from the two cross-polarized data. The proposed DCPOE algorithm is less affected by environmental noise and performs robust and accurate azimuth angle estimation. The effectiveness of the proposed techniques in the detection and orientation estimation and their advantages over the existing method have been demonstrated using experimental data. Comparison results show that the maximum and average errors are 22.3{\deg} and 10.9{\deg} for the Alford rotation algorithm, while those are 4.9{\deg} and 1.8{\deg} for the proposed DCPOE algorithm in the demonstrated shallowly buried object cases. The proposed techniques can be unified in a framework to facilitate the investigation and mapping of shallowly buried and elongated targets.


Human-Object Interaction Detection:A Quick Survey and Examination of Methods

Sep 27, 2020
Trevor Bergstrom, Humphrey Shi

Human-object interaction detection is a relatively new task in the world of computer vision and visual semantic information extraction. With the goal of machines identifying interactions that humans perform on objects, there are many real-world use cases for the research in this field. To our knowledge, this is the first general survey of the state-of-the-art and milestone works in this field. We provide a basic survey of the developments in the field of human-object interaction detection. Many works in this field use multi-stream convolutional neural network architectures, which combine features from multiple sources in the input image. Most commonly these are the humans and objects in question, as well as the spatial quality of the two. As far as we are aware, there have not been in-depth studies performed that look into the performance of each component individually. In order to provide insight to future researchers, we perform an individualized study that examines the performance of each component of a multi-stream convolutional neural network architecture for human-object interaction detection. Specifically, we examine the HORCNN architecture as it is a foundational work in the field. In addition, we provide an in-depth look at the HICO-DET dataset, a popular benchmark in the field of human-object interaction detection. Code and papers can be found at

* Published at The 1st International Workshop On Human-Centric Multimedia Analysis, at ACM Multimedia Conference 2020 

Full Object Boundary Detection by Applying Scale Invariant Features in a Region Merging Segmentation Algorithm

Oct 26, 2012
Reza Oji, Farshad Tajeripour

Object detection is a fundamental task in computer vision and has many applications in image processing. This paper proposes a new approach for object detection by applying scale invariant feature transform (SIFT) in an automatic segmentation algorithm. SIFT is an invariant algorithm respect to scale, translation and rotation. The features are very distinct and provide stable keypoints that can be used for matching an object in different images. At first, an object is trained with different aspects for finding best keypoints. The object can be recognized in the other images by using achieved keypoints. Then, a robust segmentation algorithm is used to detect the object with full boundary based on SIFT keypoints. In segmentation algorithm, a merging role is defined to merge the regions in image with the assistance of keypoints. The results show that the proposed approach is reliable for object detection and can extract object boundary well.

* International Journal of Artificial Intelligence & Applications (IJAIA) (2012) Volume 3, Number 5, pp: 41-50 
* 10 pages - 7 figures 

Detect or Track: Towards Cost-Effective Video Object Detection/Tracking

Nov 13, 2018
Hao Luo, Wenxuan Xie, Xinggang Wang, Wenjun Zeng

State-of-the-art object detectors and trackers are developing fast. Trackers are in general more efficient than detectors but bear the risk of drifting. A question is hence raised -- how to improve the accuracy of video object detection/tracking by utilizing the existing detectors and trackers within a given time budget? A baseline is frame skipping -- detecting every N-th frames and tracking for the frames in between. This baseline, however, is suboptimal since the detection frequency should depend on the tracking quality. To this end, we propose a scheduler network, which determines to detect or track at a certain frame, as a generalization of Siamese trackers. Although being light-weight and simple in structure, the scheduler network is more effective than the frame skipping baselines and flow-based approaches, as validated on ImageNet VID dataset in video object detection/tracking.

* Accepted to AAAI 2019 

VIN: Voxel-based Implicit Network for Joint 3D Object Detection and Segmentation for Lidars

Jul 07, 2021
Yuanxin Zhong, Minghan Zhu, Huei Peng

A unified neural network structure is presented for joint 3D object detection and point cloud segmentation in this paper. We leverage rich supervision from both detection and segmentation labels rather than using just one of them. In addition, an extension based on single-stage object detectors is proposed based on the implicit function widely used in 3D scene and object understanding. The extension branch takes the final feature map from the object detection module as input, and produces an implicit function that generates semantic distribution for each point for its corresponding voxel center. We demonstrated the performance of our structure on nuScenes-lidarseg, a large-scale outdoor dataset. Our solution achieves competitive results against state-of-the-art methods in both 3D object detection and point cloud segmentation with little additional computation load compared with object detection solutions. The capability of efficient weakly supervision semantic segmentation of the proposed method is also validated by experiments.


SaLite : A light-weight model for salient object detection

Dec 08, 2019
Kitty Varghese, Sauradip Nag

Salient object detection is a prevalent computer vision task that has applications ranging from abnormality detection to abnormality processing. Context modelling is an important criterion in the domain of saliency detection. A global context helps in determining the salient object in a given image by contrasting away other objects in the global view of the scene. However, the local context features detects the boundaries of the salient object with higher accuracy in a given region. To incorporate the best of both worlds, our proposed SaLite model uses both global and local contextual features. It is an encoder-decoder based architecture in which the encoder uses a lightweight SqueezeNet and decoder is modelled using convolution layers. Modern deep based models entitled for saliency detection use a large number of parameters, which is difficult to deploy on embedded systems. This paper attempts to solve the above problem using SaLite which is a lighter process for salient object detection without compromising on performance. Our approach is extensively evaluated on three publicly available datasets namely DUTS, MSRA10K, and SOC. Experimental results show that our proposed SaLite has significant and consistent improvements over the state-of-the-art methods.

* This was submitted to NCVPRIPG 2019 

Fast Efficient Object Detection Using Selective Attention

Nov 19, 2018
Shivanthan Yohanandan, Andy Song, Adrian G. Dyer, Angela Faragasso, Subhrajit Roy, Dacheng Tao

Deep learning object detectors achieve state-of-the-art accuracy at the expense of high computational overheads, impeding their utilization on embedded systems such as drones. A primary source of these overheads is the exhaustive classification of typically 10^4-10^5 regions per image. Given that most of these regions contain uninformative background, the detector designs seem extremely superfluous and inefficient. In contrast, biological vision systems leverage selective attention for fast and efficient object detection. Recent neuroscientific findings shedding new light on the mechanism behind selective attention allowed us to formulate a new hypothesis of object detection efficiency and subsequently introduce a new object detection paradigm. To that end, we leverage this knowledge to design a novel region proposal network and empirically show that it achieves high object detection performance on the COCO dataset. Moreover, the model uses two to three orders of magnitude fewer computations than state-of-the-art models and consequently achieves inference speeds exceeding 500 frames/s, thereby making it possible to achieve object detection on embedded systems.

* 10 pages, 9 figures