Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Object Detection": models, code, and papers

DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes

Apr 02, 2020
Mahyar Najibi, Guangda Lai, Abhijit Kundu, Zhichao Lu, Vivek Rathod, Tom Funkhouser, Caroline Pantofaru, David Ross, Larry S. Davis, Alireza Fathi

We propose DOPS, a fast single-stage 3D object detection method for LIDAR data. Previous methods often make domain-specific design decisions, for example projecting points into a bird-eye view image in autonomous driving scenarios. In contrast, we propose a general-purpose method that works on both indoor and outdoor scenes. The core novelty of our method is a fast, single-pass architecture that both detects objects in 3D and estimates their shapes. 3D bounding box parameters are estimated in one pass for every point, aggregated through graph convolutions, and fed into a branch of the network that predicts latent codes representing the shape of each detected object. The latent shape space and shape decoder are learned on a synthetic dataset and then used as supervision for the end-to-end training of the 3D object detection pipeline. Thus our model is able to extract shapes without access to ground-truth shape information in the target dataset. During experiments, we find that our proposed method achieves state-of-the-art results by ~5% on object detection in ScanNet scenes, and it gets top results by 3.4% in the Waymo Open Dataset, while reproducing the shapes of detected cars.

* To appear in CVPR 2020 
Access Paper or Ask Questions

Neural Architecture Adaptation for Object Detection by Searching Channel Dimensions and Mapping Pre-trained Parameters

Jun 17, 2022
Harim Jung, Myeong-Seok Oh, Cheoljong Yang, Seong-Whan Lee

Most object detection frameworks use backbone architectures originally designed for image classification, conventionally with pre-trained parameters on ImageNet. However, image classification and object detection are essentially different tasks and there is no guarantee that the optimal backbone for classification is also optimal for object detection. Recent neural architecture search (NAS) research has demonstrated that automatically designing a backbone specifically for object detection helps improve the overall accuracy. In this paper, we introduce a neural architecture adaptation method that can optimize the given backbone for detection purposes, while still allowing the use of pre-trained parameters. We propose to adapt both the micro- and macro-architecture by searching for specific operations and the number of layers, in addition to the output channel dimensions of each block. It is important to find the optimal channel depth, as it greatly affects the feature representation capability and computation cost. We conduct experiments with our searched backbone for object detection and demonstrate that our backbone outperforms both manually designed and searched state-of-the-art backbones on the COCO dataset.

* Accepted to ICPR 2022 
Access Paper or Ask Questions

3D Object Detection on Point Clouds using Local Ground-aware and Adaptive Representation of scenes' surface

Feb 02, 2020
Arun CS Kumar, Disha Ahuja, Ashwath Aithal

A novel, adaptive ground-aware, and cost-effective 3D Object Detection pipeline is proposed. The ground surface representation introduced in this paper, in comparison to its uni-planar counterparts (methods that model the surface of a whole 3D scene using single plane), is far more accurate while being ~10x faster. The novelty of the ground representation lies both in the way in which the ground surface of the scene is represented in Lidar perception problems, as well as in the (cost-efficient) way in which it is computed. Furthermore, the proposed object detection pipeline builds on the traditional two-stage object detection models by incorporating the ability to dynamically reason the surface of the scene, ultimately achieving a new state-of-the-art 3D object detection performance among the two-stage Lidar Object Detection pipelines.

Access Paper or Ask Questions

Hashmod: A Hashing Method for Scalable 3D Object Detection

Jul 20, 2016
Wadim Kehl, Federico Tombari, Nassir Navab, Slobodan Ilic, Vincent Lepetit

We present a scalable method for detecting objects and estimating their 3D poses in RGB-D data. To this end, we rely on an efficient representation of object views and employ hashing techniques to match these views against the input frame in a scalable way. While a similar approach already exists for 2D detection, we show how to extend it to estimate the 3D pose of the detected objects. In particular, we explore different hashing strategies and identify the one which is more suitable to our problem. We show empirically that the complexity of our method is sublinear with the number of objects and we enable detection and pose estimation of many 3D objects with high accuracy while outperforming the state-of-the-art in terms of runtime.

* BMVC 2015 
Access Paper or Ask Questions

CenterNet++ for Object Detection

Apr 18, 2022
Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, Qi Tian

There are two mainstreams for object detection: top-down and bottom-up. The state-of-the-art approaches mostly belong to the first category. In this paper, we demonstrate that the bottom-up approaches are as competitive as the top-down and enjoy higher recall. Our approach, named CenterNet, detects each object as a triplet keypoints (top-left and bottom-right corners and the center keypoint). We firstly group the corners by some designed cues and further confirm the objects by the center keypoints. The corner keypoints equip the approach with the ability to detect objects of various scales and shapes and the center keypoint avoids the confusion brought by a large number of false-positive proposals. Our approach is a kind of anchor-free detector because it does not need to define explicit anchor boxes. We adapt our approach to the backbones with different structures, i.e., the 'hourglass' like networks and the the 'pyramid' like networks, which detect objects on a single-resolution feature map and multi-resolution feature maps, respectively. On the MS-COCO dataset, CenterNet with Res2Net-101 and Swin-Transformer achieves APs of 53.7% and 57.1%, respectively, outperforming all existing bottom-up detectors and achieving state-of-the-art. We also design a real-time CenterNet, which achieves a good trade-off between accuracy and speed with an AP of 43.6% at 30.5 FPS.

* 11 pages, 9 figures, 8 tables. arXiv admin note: substantial text overlap with arXiv:1904.08189 
Access Paper or Ask Questions

Detecting and Tracking Small and Dense Moving Objects in Satellite Videos: A Benchmark

Nov 25, 2021
Qian Yin, Qingyong Hu, Hao Liu, Feng Zhang, Yingqian Wang, Zaiping Lin, Wei An, Yulan Guo

Satellite video cameras can provide continuous observation for a large-scale area, which is important for many remote sensing applications. However, achieving moving object detection and tracking in satellite videos remains challenging due to the insufficient appearance information of objects and lack of high-quality datasets. In this paper, we first build a large-scale satellite video dataset with rich annotations for the task of moving object detection and tracking. This dataset is collected by the Jilin-1 satellite constellation and composed of 47 high-quality videos with 1,646,038 instances of interest for object detection and 3,711 trajectories for object tracking. We then introduce a motion modeling baseline to improve the detection rate and reduce false alarms based on accumulative multi-frame differencing and robust matrix completion. Finally, we establish the first public benchmark for moving object detection and tracking in satellite videos, and extensively evaluate the performance of several representative approaches on our dataset. Comprehensive experimental analyses and insightful conclusions are also provided. The dataset is available at

* This paper has been accepted by IEEE Transactions on Geoscience and Remote Sensing. Qian Yin and Qingyong Hu have equal contributions to this work and are co-first authors. The dataset is available at 
Access Paper or Ask Questions

Reliable Multi-Object Tracking in the Presence of Unreliable Detections

Dec 15, 2021
Travis Mandel, Mark Jimenez, Emily Risley, Taishi Nammoto, Rebekka Williams, Max Panoff, Meynard Ballesteros, Bobbie Suarez

Recent multi-object tracking (MOT) systems have leveraged highly accurate object detectors; however, training such detectors requires large amounts of labeled data. Although such data is widely available for humans and vehicles, it is significantly more scarce for other animal species. We present Robust Confidence Tracking (RCT), an algorithm designed to maintain robust performance even when detection quality is poor. In contrast to prior methods which discard detection confidence information, RCT takes a fundamentally different approach, relying on the exact detection confidence values to initialize tracks, extend tracks, and filter tracks. In particular, RCT is able to minimize identity switches by efficiently using low-confidence detections (along with a single object tracker) to keep continuous track of objects. To evaluate trackers in the presence of unreliable detections, we present a challenging real-world underwater fish tracking dataset, FISHTRAC. In an evaluation on FISHTRAC as well as the UA-DETRAC dataset, we find that RCT outperforms other algorithms when provided with imperfect detections, including state-of-the-art deep single and multi-object trackers as well as more classic approaches. Specifically, RCT has the best average HOTA across methods that successfully return results for all sequences, and has significantly less identity switches than other methods.

* 12 pages, 5 figures, 6 tables 
Access Paper or Ask Questions

Spatio-Temporal Action Detection with Multi-Object Interaction

Apr 01, 2020
Huijuan Xu, Lizhi Yang, Stan Sclaroff, Kate Saenko, Trevor Darrell

Spatio-temporal action detection in videos requires localizing the action both spatially and temporally in the form of an "action tube". Nowadays, most spatio-temporal action detection datasets (e.g. UCF101-24, AVA, DALY) are annotated with action tubes that contain a single person performing the action, thus the predominant action detection models simply employ a person detection and tracking pipeline for localization. However, when the action is defined as an interaction between multiple objects, such methods may fail since each bounding box in the action tube contains multiple objects instead of one person. In this paper, we study the spatio-temporal action detection problem with multi-object interaction. We introduce a new dataset that is annotated with action tubes containing multi-object interactions. Moreover, we propose an end-to-end spatio-temporal action detection model that performs both spatial and temporal regression simultaneously. Our spatial regression may enclose multiple objects participating in the action. During test time, we simply connect the regressed bounding boxes within the predicted temporal duration using a simple heuristic. We report the baseline results of our proposed model on this new dataset, and also show competitive results on the standard benchmark UCF101-24 using only RGB input.

Access Paper or Ask Questions

An Analysis of Pre-Training on Object Detection

Apr 11, 2019
Hengduo Li, Bharat Singh, Mahyar Najibi, Zuxuan Wu, Larry S. Davis

We provide a detailed analysis of convolutional neural networks which are pre-trained on the task of object detection. To this end, we train detectors on large datasets like OpenImagesV4, ImageNet Localization and COCO. We analyze how well their features generalize to tasks like image classification, semantic segmentation and object detection on small datasets like PASCAL-VOC, Caltech-256, SUN-397, Flowers-102 etc. Some important conclusions from our analysis are --- 1) Pre-training on large detection datasets is crucial for fine-tuning on small detection datasets, especially when precise localization is needed. For example, we obtain 81.1% mAP on the PASCAL-VOC dataset at 0.7 IoU after pre-training on OpenImagesV4, which is 7.6% better than the recently proposed DeformableConvNetsV2 which uses ImageNet pre-training. 2) Detection pre-training also benefits other localization tasks like semantic segmentation but adversely affects image classification. 3) Features for images (like avg. pooled Conv5) which are similar in the object detection feature space are likely to be similar in the image classification feature space but the converse is not true. 4) Visualization of features reveals that detection neurons have activations over an entire object, while activations for classification networks typically focus on parts. Therefore, detection networks are poor at classification when multiple instances are present in an image or when an instance only covers a small fraction of an image.

Access Paper or Ask Questions

Contextual Hypergraph Modelling for Salient Object Detection

Oct 22, 2013
Xi Li, Yao Li, Chunhua Shen, Anthony Dick, Anton van den Hengel

Salient object detection aims to locate objects that capture human attention within images. Previous approaches often pose this as a problem of image contrast analysis. In this work, we model an image as a hypergraph that utilizes a set of hyperedges to capture the contextual properties of image pixels or regions. As a result, the problem of salient object detection becomes one of finding salient vertices and hyperedges in the hypergraph. The main advantage of hypergraph modeling is that it takes into account each pixel's (or region's) affinity with its neighborhood as well as its separation from image background. Furthermore, we propose an alternative approach based on center-versus-surround contextual contrast analysis, which performs salient object detection by optimizing a cost-sensitive support vector machine (SVM) objective function. Experimental results on four challenging datasets demonstrate the effectiveness of the proposed approaches against the state-of-the-art approaches to salient object detection.

* Appearing in Proc. Int. Conf. Computer Vision 2013, Sydney, Australia 
Access Paper or Ask Questions