Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Object Detection": models, code, and papers

Object Detection based on Region Decomposition and Assembly

Jan 24, 2019
Seung-Hwan Bae

Region-based object detection infers object regions for one or more categories in an image. Due to the recent advances in deep learning and region proposal methods, object detectors based on convolutional neural networks (CNNs) have been flourishing and provided the promising detection results. However, the detection accuracy is degraded often because of the low discriminability of object CNN features caused by occlusions and inaccurate region proposals. In this paper, we therefore propose a region decomposition and assembly detector (R-DAD) for more accurate object detection. In the proposed R-DAD, we first decompose an object region into multiple small regions. To capture an entire appearance and part details of the object jointly, we extract CNN features within the whole object region and decomposed regions. We then learn the semantic relations between the object and its parts by combining the multi-region features stage by stage with region assembly blocks, and use the combined and high-level semantic features for the object classification and localization. In addition, for more accurate region proposals, we propose a multi-scale proposal layer that can generate object proposals of various scales. We integrate the R-DAD into several feature extractors, and prove the distinct performance improvement on PASCAL07/12 and MSCOCO18 compared to the recent convolutional detectors.

* Accepted to 2019 AAAI Conference on Artificial Intelligence (AAAI) 

Robotic Interestingness via Human-Informed Few-Shot Object Detection

Aug 01, 2022
Seungchan Kim, Chen Wang, Bowen Li, Sebastian Scherer

Interestingness recognition is crucial for decision making in autonomous exploration for mobile robots. Previous methods proposed an unsupervised online learning approach that can adapt to environments and detect interesting scenes quickly, but lack the ability to adapt to human-informed interesting objects. To solve this problem, we introduce a human-interactive framework, AirInteraction, that can detect human-informed objects via few-shot online learning. To reduce the communication bandwidth, we first apply an online unsupervised learning algorithm on the unmanned vehicle for interestingness recognition and then only send the potential interesting scenes to a base-station for human inspection. The human operator is able to draw and provide bounding box annotations for particular interesting objects, which are sent back to the robot to detect similar objects via few-shot learning. Only using few human-labeled examples, the robot can learn novel interesting object categories during the mission and detect interesting scenes that contain the objects. We evaluate our method on various interesting scene recognition datasets. To the best of our knowledge, it is the first human-informed few-shot object detection framework for autonomous exploration.


3D Object Detection Combining Semantic and Geometric Features from Point Clouds

Oct 10, 2021
Hao Peng, Guofeng Tong, Zheng Li, Yaqi Wang, Yuyuan Shao

In this paper, we investigate the combination of voxel-based methods and point-based methods, and propose a novel end-to-end two-stage 3D object detector named SGNet for point clouds scenes. The voxel-based methods voxelize the scene to regular grids, which can be processed with the current advanced feature learning frameworks based on convolutional layers for semantic feature learning. Whereas the point-based methods can better extract the geometric feature of the point due to the coordinate reservations. The combination of the two is an effective solution for 3D object detection from point clouds. However, most current methods use a voxel-based detection head with anchors for final classification and localization. Although the preset anchors cover the entire scene, it is not suitable for point clouds detection tasks with larger scenes and multiple categories due to the limitation of voxel size. In this paper, we propose a voxel-to-point module (VTPM) that captures semantic and geometric features. The VTPM is a Voxel-Point-Based Module that finally implements 3D object detection in point space, which is more conducive to the detection of small-size objects and avoids the presets of anchors in inference stage. In addition, a Confidence Adjustment Module (CAM) with the center-boundary-aware confidence attention is proposed to solve the misalignment between the predicted confidence and proposals in the regions of the interest (RoI) selection. The SGNet proposed in this paper has achieved state-of-the-art results for 3D object detection in the KITTI dataset, especially in the detection of small-size objects such as cyclists. Actually, as of September 19, 2021, for KITTI dataset, SGNet ranked 1st in 3D and BEV detection on cyclists with easy difficulty level, and 2nd in the 3D detection of moderate cyclists.


Weakly Supervised Object Detection with Segmentation Collaboration

Apr 01, 2019
Xiaoyan Li, Meina Kan, Shiguang Shan, Xilin Chen

Weakly supervised object detection aims at learning precise object detectors, given image category labels. In recent prevailing works, this problem is generally formulated as a multiple instance learning module guided by an image classification loss. The object bounding box is assumed to be the one contributing most to the classification among all proposals. However, the region contributing most is also likely to be a crucial part or the supporting context of an object. To obtain a more accurate detector, in this work we propose a novel end-to-end weakly supervised detection approach, where a newly introduced generative adversarial segmentation module interacts with the conventional detection module in a collaborative loop. The collaboration mechanism takes full advantages of the complementary interpretations of the weakly supervised localization task, namely detection and segmentation tasks, forming a more comprehensive solution. Consequently, our method obtains more precise object bounding boxes, rather than parts or irrelevant surroundings. Expectedly, the proposed method achieves an accuracy of 51.0% on the PASCAL VOC 2007 dataset, outperforming the state-of-the-arts and demonstrating its superiority for weakly supervised object detection.


Improving Object Detection and Attribute Recognition by Feature Entanglement Reduction

Aug 25, 2021
Zhaoheng Zheng, Arka Sadhu, Ram Nevatia

We explore object detection with two attributes: color and material. The task aims to simultaneously detect objects and infer their color and material. A straight-forward approach is to add attribute heads at the very end of a usual object detection pipeline. However, we observe that the two goals are in conflict: Object detection should be attribute-independent and attributes be largely object-independent. Features computed by a standard detection network entangle the category and attribute features; we disentangle them by the use of a two-stream model where the category and attribute features are computed independently but the classification heads share Regions of Interest (RoIs). Compared with a traditional single-stream model, our model shows significant improvements over VG-20, a subset of Visual Genome, on both supervised and attribute transfer tasks.

* Camera-ready for ICIP 2021 

Hierarchical Structure and Joint Training for Large Scale Semi-supervised Object Detection

May 30, 2019
Ye Guo, Yali Li, Shengjin Wang

Generic object detection is one of the most fundamental and important problems in computer vision. When it comes to large scale object detection for thousands of categories, it is unpractical to provide all the bounding box labels for each category. In this paper, we propose a novel hierarchical structure and joint training framework for large scale semi-supervised object detection. First, we utilize the relationships among target categories to model a hierarchical network to further improve the performance of recognition. Second, we combine bounding-box-level labeled images and image-level labeled images together for joint training, and the proposed method can be easily applied in current two-stage object detection framework with excellent performance. Experimental results show that the proposed large scale semi-supervised object detection network obtains the state-of-the-art performance, with the mAP of 38.1% on the ImageNet detection validation dataset.


Oriented Object Detection in Aerial Images with Box Boundary-Aware Vectors

Aug 29, 2020
Jingru Yi, Pengxiang Wu, Bo Liu, Qiaoying Huang, Hui Qu, Dimitris Metaxas

Oriented object detection in aerial images is a challenging task as the objects in aerial images are displayed in arbitrary directions and are usually densely packed. Current oriented object detection methods mainly rely on two-stage anchor-based detectors. However, the anchor-based detectors typically suffer from a severe imbalance issue between the positive and negative anchor boxes. To address this issue, in this work we extend the horizontal keypoint-based object detector to the oriented object detection task. In particular, we first detect the center keypoints of the objects, based on which we then regress the box boundary-aware vectors (BBAVectors) to capture the oriented bounding boxes. The box boundary-aware vectors are distributed in the four quadrants of a Cartesian coordinate system for all arbitrarily oriented objects. To relieve the difficulty of learning the vectors in the corner cases, we further classify the oriented bounding boxes into horizontal and rotational bounding boxes. In the experiment, we show that learning the box boundary-aware vectors is superior to directly predicting the width, height, and angle of an oriented bounding box, as adopted in the baseline method. Besides, the proposed method competes favorably with state-of-the-art methods. Code is available at

* Accepted to WACV2021 

Visibility Guided NMS: Efficient Boosting of Amodal Object Detection in Crowded Traffic Scenes

Jun 15, 2020
Nils Gählert, Niklas Hanselmann, Uwe Franke, Joachim Denzler

Object detection is an important task in environment perception for autonomous driving. Modern 2D object detection frameworks such as Yolo, SSD or Faster R-CNN predict multiple bounding boxes per object that are refined using Non-Maximum-Suppression (NMS) to suppress all but one bounding box. While object detection itself is fully end-to-end learnable and does not require any manual parameter selection, standard NMS is parametrized by an overlap threshold that has to be chosen by hand. In practice, this often leads to an inability of standard NMS strategies to distinguish different objects in crowded scenes in the presence of high mutual occlusion, e.g. for parked cars or crowds of pedestrians. Our novel Visibility Guided NMS (vg-NMS) leverages both pixel-based as well as amodal object detection paradigms and improves the detection performance especially for highly occluded objects with little computational overhead. We evaluate vg-NMS using KITTI, VIPER as well as the Synscapes dataset and show that it outperforms current state-of-the-art NMS.

* Machine Learning for Autonomous Driving Workshop at the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada 

Pseudo Mask Augmented Object Detection

Mar 27, 2018
Xiangyun Zhao, Shuang Liang, Yichen Wei

In this work, we present a novel and effective framework to facilitate object detection with the instance-level segmentation information that is only supervised by bounding box annotation. Starting from the joint object detection and instance segmentation network, we propose to recursively estimate the pseudo ground-truth object masks from the instance-level object segmentation network training, and then enhance the detection network with top-down segmentation feedbacks. The pseudo ground truth mask and network parameters are optimized alternatively to mutually benefit each other. To obtain the promising pseudo masks in each iteration, we embed a graphical inference that incorporates the low-level image appearance consistency and the bounding box annotations to refine the segmentation masks predicted by the segmentation network. Our approach progressively improves the object detection performance by incorporating the detailed pixel-wise information learned from the weakly-supervised segmentation network. Extensive evaluation on the detection task in PASCAL VOC 2007 and 2012 [12] verifies that the proposed approach is effective.