Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Object Detection": models, code, and papers

Learning Instance-Aware Object Detection Using Determinantal Point Processes

May 30, 2018
Nuri Kim, Donghoon Lee, Songhwai Oh

Recent object detectors find instances while categorizing candidate regions in an input image. As each region is evaluated independently, the number of candidate regions from a detector is usually larger than the number of objects. Since the final goal of detection is to assign a single detection to each object, an additional algorithm, such as non-maximum suppression (NMS), is used to select a single bounding box for an object. While simple heuristic algorithms, such as NMS, are effective for stand-alone objects, they can fail to detect overlapped objects. In this paper, we address this issue by training a network to distinguish different objects while localizing and categorizing them. We propose an instance-aware detection network (IDNet), which can learn to extract features from candidate regions and measures their similarities. Based on pairwise similarities and detection qualities, the IDNet selects an optimal subset of candidate bounding boxes using determinantal point processes (DPPs). Extensive experiments demonstrate that the proposed algorithm performs favorably compared to existing state-of-the-art detection methods particularly for overlapped objects on the PASCAL VOC and MS COCO datasets.


Understanding Object Detection Through An Adversarial Lens

Jul 11, 2020
Ka-Ho Chow, Ling Liu, Mehmet Emre Gursoy, Stacey Truex, Wenqi Wei, Yanzhao Wu

Deep neural networks based object detection models have revolutionized computer vision and fueled the development of a wide range of visual recognition applications. However, recent studies have revealed that deep object detectors can be compromised under adversarial attacks, causing a victim detector to detect no object, fake objects, or mislabeled objects. With object detection being used pervasively in many security-critical applications, such as autonomous vehicles and smart cities, we argue that a holistic approach for an in-depth understanding of adversarial attacks and vulnerabilities of deep object detection systems is of utmost importance for the research community to develop robust defense mechanisms. This paper presents a framework for analyzing and evaluating vulnerabilities of the state-of-the-art object detectors under an adversarial lens, aiming to analyze and demystify the attack strategies, adverse effects, and costs, as well as the cross-model and cross-resolution transferability of attacks. Using a set of quantitative metrics, extensive experiments are performed on six representative deep object detectors from three popular families (YOLOv3, SSD, and Faster R-CNN) with two benchmark datasets (PASCAL VOC and MS COCO). We demonstrate that the proposed framework can serve as a methodical benchmark for analyzing adversarial behaviors and risks in real-time object detection systems. We conjecture that this framework can also serve as a tool to assess the security risks and the adversarial robustness of deep object detectors to be deployed in real-world applications.


Task-Focused Few-Shot Object Detection for Robot Manipulation

Jan 28, 2022
Brent Griffin

This paper addresses the problem of mobile robot manipulation of novel objects via detection. Our approach uses vision and control as complementary functions that learn from real-world tasks. We develop a manipulation method based solely on detection then introduce task-focused few-shot object detection to learn new objects and settings. The current paradigm for few-shot object detection uses existing annotated examples. In contrast, we extend this paradigm by using active data collection and annotation selection that improves performance for specific downstream tasks (e.g., depth estimation and grasping). In experiments for our interactive approach to few-shot learning, we train a robot to manipulate objects directly from detection (ClickBot). ClickBot learns visual servo control from a single click of annotation, grasps novel objects in clutter and other settings, and achieves state-of-the-art results on an existing visual servo control and depth estimation benchmark. Finally, we establish a task-focused few-shot object detection benchmark to support future research:


A Survey of Self-Supervised and Few-Shot Object Detection

Nov 08, 2021
Gabriel Huang, Issam Laradji, David Vazquez, Simon Lacoste-Julien, Pau Rodriguez

Labeling data is often expensive and time-consuming, especially for tasks such as object detection and instance segmentation, which require dense labeling of the image. While few-shot object detection is about training a model on novel (unseen) object classes with little data, it still requires prior training on many labeled examples of base (seen) classes. On the other hand, self-supervised methods aim at learning representations from unlabeled data which transfer well to downstream tasks such as object detection. Combining few-shot and self-supervised object detection is a promising research direction. In this survey, we review and characterize the most recent approaches on few-shot and self-supervised object detection. Then, we give our main takeaways and discuss future research directions. Project page at

* Awesome Few-Shot Object Detection (Leaderboard) at 

Decoupled Self Attention for Accurate One Stage Object Detection

Dec 14, 2020
Kehe WUa, Zuge Chena, Qi MAb, Xiaoliang Zhanga, Wei Lia

As the scale of object detection dataset is smaller than that of image recognition dataset ImageNet, transfer learning has become a basic training method for deep learning object detection models, which will pretrain the backbone network of object detection model on ImageNet dataset to extract features for classification and localization subtasks. However, the classification task focuses on the salient region features of object, while the location task focuses on the edge features of object, so there is certain deviation between the features extracted by pretrained backbone network and the features used for localization task. In order to solve this problem, a decoupled self attention(DSA) module is proposed for one stage object detection models in this paper. DSA includes two decoupled self-attention branches, so it can extract appropriate features for different tasks. It is located between FPN and head networks of subtasks, so it is used to extract global features based on FPN fused features for different tasks independently. Although the network of DSA module is simple, but it can effectively improve the performance of object detection, also it can be easily embedded in many detection models. Our experiments are based on the representative one-stage detection model RetinaNet. In COCO dataset, when ResNet50 and ResNet101 are used as backbone networks, the detection performances can be increased by 0.4% AP and 0.5% AP respectively. When DSA module and object confidence task are applied in RetinaNet together, the detection performances based on ResNet50 and ResNet101 can be increased by 1.0% AP and 1.4% AP respectively. The experiment results show the effectiveness of DSA module.

* 15 pages, 5 figures 

Object Detection in Videos with Tubelet Proposal Networks

Apr 10, 2017
Kai Kang, Hongsheng Li, Tong Xiao, Wanli Ouyang, Junjie Yan, Xihui Liu, Xiaogang Wang

Object detection in videos has drawn increasing attention recently with the introduction of the large-scale ImageNet VID dataset. Different from object detection in static images, temporal information in videos is vital for object detection. To fully utilize temporal information, state-of-the-art methods are based on spatiotemporal tubelets, which are essentially sequences of associated bounding boxes across time. However, the existing methods have major limitations in generating tubelets in terms of quality and efficiency. Motion-based methods are able to obtain dense tubelets efficiently, but the lengths are generally only several frames, which is not optimal for incorporating long-term temporal information. Appearance-based methods, usually involving generic object tracking, could generate long tubelets, but are usually computationally expensive. In this work, we propose a framework for object detection in videos, which consists of a novel tubelet proposal network to efficiently generate spatiotemporal proposals, and a Long Short-term Memory (LSTM) network that incorporates temporal information from tubelet proposals for achieving high object detection accuracy in videos. Experiments on the large-scale ImageNet VID dataset demonstrate the effectiveness of the proposed framework for object detection in videos.

* CVPR 2017 

Group-Free 3D Object Detection via Transformers

Apr 23, 2021
Ze Liu, Zheng Zhang, Yue Cao, Han Hu, Xin Tong

Recently, directly detecting 3D objects from 3D point clouds has received increasing attention. To extract object representation from an irregular point cloud, existing methods usually take a point grouping step to assign the points to an object candidate so that a PointNet-like network could be used to derive object features from the grouped points. However, the inaccurate point assignments caused by the hand-crafted grouping scheme decrease the performance of 3D object detection. In this paper, we present a simple yet effective method for directly detecting 3D objects from the 3D point cloud. Instead of grouping local points to each object candidate, our method computes the feature of an object from all the points in the point cloud with the help of an attention mechanism in the Transformers \cite{vaswani2017attention}, where the contribution of each point is automatically learned in the network training. With an improved attention stacking scheme, our method fuses object features in different stages and generates more accurate object detection results. With few bells and whistles, the proposed method achieves state-of-the-art 3D object detection performance on two widely used benchmarks, ScanNet V2 and SUN RGB-D. The code and models are publicly available at \url{}


Focal Loss in 3D Object Detection

Sep 18, 2018
Peng Yun, Lei Tai, Yuan Wang, Ming Liu

3D object detection is still an open problem in autonomous driving scenes. Robots recognize and localize key objects from sparse inputs, and suffer from a larger continuous searching space as well as serious fore-background imbalance compared to the image-based detection. In this paper, we try to solve the fore-background imbalance in the 3D object detection task. Inspired by the recent improvement of focal loss on image-based detection which is seen as a hard-mining improvement of binary cross entropy, we extend it to point-cloud-based object detection and conduct experiments to show its performance based on two different type of 3D detectors: 3D-FCN and VoxelNet. The results show up to 11.2 AP gains from focal loss in a wide range of hyperparameters in 3D object detection. Our code is available at

* Our code is available at 

Attentive Contexts for Object Detection

Mar 24, 2016
Jianan Li, Yunchao Wei, Xiaodan Liang, Jian Dong, Tingfa Xu, Jiashi Feng, Shuicheng Yan

Modern deep neural network based object detection methods typically classify candidate proposals using their interior features. However, global and local surrounding contexts that are believed to be valuable for object detection are not fully exploited by existing methods yet. In this work, we take a step towards understanding what is a robust practice to extract and utilize contextual information to facilitate object detection in practice. Specifically, we consider the following two questions: "how to identify useful global contextual information for detecting a certain object?" and "how to exploit local context surrounding a proposal for better inferring its contents?". We provide preliminary answers to these questions through developing a novel Attention to Context Convolution Neural Network (AC-CNN) based object detection model. AC-CNN effectively incorporates global and local contextual information into the region-based CNN (e.g. Fast RCNN) detection model and provides better object detection performance. It consists of one attention-based global contextualized (AGC) sub-network and one multi-scale local contextualized (MLC) sub-network. To capture global context, the AGC sub-network recurrently generates an attention map for an input image to highlight useful global contextual locations, through multiple stacked Long Short-Term Memory (LSTM) layers. For capturing surrounding local context, the MLC sub-network exploits both the inside and outside contextual information of each specific proposal at multiple scales. The global and local context are then fused together for making the final decision for detection. Extensive experiments on PASCAL VOC 2007 and VOC 2012 well demonstrate the superiority of the proposed AC-CNN over well-established baselines. In particular, AC-CNN outperforms the popular Fast-RCNN by 2.0% and 2.2% on VOC 2007 and VOC 2012 in terms of mAP, respectively.