Object class detection has been a synonym for 2D bounding box localization for the longest time, fueled by the success of powerful statistical learning techniques, combined with robust image representations. Only recently, there has been a growing interest in revisiting the promise of computer vision from the early days: to precisely delineate the contents of a visual scene, object by object, in 3D. In this paper, we draw from recent advances in object detection and 2D-3D object lifting in order to design an object class detector that is particularly tailored towards 3D object class detection. Our 3D object class detection method consists of several stages gradually enriching the object detection output with object viewpoint, keypoints and 3D shape estimates. Following careful design, in each stage it constantly improves the performance and achieves state-ofthe-art performance in simultaneous 2D bounding box and viewpoint estimation on the challenging Pascal3D+ dataset.
In CNN-based object detection methods, region proposal becomes a bottleneck when objects exhibit significant scale variation, occlusion or truncation. In addition, these methods mainly focus on 2D object detection and cannot estimate detailed properties of objects. In this paper, we propose subcategory-aware CNNs for object detection. We introduce a novel region proposal network that uses subcategory information to guide the proposal generating process, and a new detection network for joint detection and subcategory classification. By using subcategories related to object pose, we achieve state-of-the-art performance on both detection and pose estimation on commonly used benchmarks.
Classical object detection frameworks lack of utilizing objects' surrounding information. In this article, we introduce the graph convolutional networks (GCN) into the object detection, and propose a new framework called OD-GCN (object detection with graph convolutional network). It utilizes the category relationship to improve the detection precision. We set up a knowledge graph to reflect the co-exist relationships among objects. GCN plays the role of post-processing to adjust the output of base object detection models. It is a flexible framework that any pre-trained object detection models can be used as the base model. In the experiments, we try several popular base detection models, OD-GCN always improve mAP by 1-5 pp in COCO dataset. In addition, visualized analysis reveals the benchmark improvement is quite logical in human's opinion.
Recent development of object detection mainly depends on deep learning with large-scale benchmarks. However, collecting such fully-annotated data is often difficult or expensive for real-world applications, which restricts the power of deep neural networks in practice. Alternatively, humans can detect new objects with little annotation burden, since humans often use the prior knowledge to identify new objects with few elaborately-annotated examples, and subsequently generalize this capacity by exploiting objects from wild images. Inspired by this procedure of learning to detect, we propose a novel Progressive Object Transfer Detection (POTD) framework. Specifically, we make three main contributions in this paper. First, POTD can leverage various object supervision of different domains effectively into a progressive detection procedure. Via such human-like learning, one can boost a target detection task with few annotations. Second, POTD consists of two delicate transfer stages, i.e., Low-Shot Transfer Detection (LSTD), and Weakly-Supervised Transfer Detection (WSTD). In LSTD, we distill the implicit object knowledge of source detector to enhance target detector with few annotations. It can effectively warm up WSTD later on. In WSTD, we design a recurrent object labelling mechanism for learning to annotate weakly-labeled images. More importantly, we exploit the reliable object supervision from LSTD, which can further enhance the robustness of target detector in the WSTD stage. Finally, we perform extensive experiments on a number of challenging detection benchmarks with different settings. The results demonstrate that, our POTD outperforms the recent state-of-the-art approaches.
There are many limitations applying object detection algorithm on various environments. Especially detecting small objects is still challenging because they have low resolution and limited information. We propose an object detection method using context for improving accuracy of detecting small objects. The proposed method uses additional features from different layers as context by concatenating multi-scale features. We also propose object detection with attention mechanism which can focus on the object in image, and it can include contextual information from target layer. Experimental results shows that proposed method also has higher accuracy than conventional SSD on detecting small objects. Also, for 300$\times$300 input, we achieved 78.1% Mean Average Precision (mAP) on the PASCAL VOC2007 test set.
In this work, we introduce a novel weakly supervised object detection (WSOD) paradigm to detect objects belonging to rare classes that have not many examples using transferable knowledge from human-object interactions (HOI). While WSOD shows lower performance than full supervision, we mainly focus on HOI as the main context which can strongly supervise complex semantics in images. Therefore, we propose a novel module called RRPN (relational region proposal network) which outputs an object-localizing attention map only with human poses and action verbs. In the source domain, we fully train an object detector and the RRPN with full supervision of HOI. With transferred knowledge about localization map from the trained RRPN, a new object detector can learn unseen objects with weak verbal supervision of HOI without bounding box annotations in the target domain. Because the RRPN is designed as an add-on type, we can apply it not only to the object detection but also to other domains such as semantic segmentation. The experimental results on HICO-DET dataset show the possibility that the proposed method can be a cheap alternative for the current supervised object detection paradigm. Moreover, qualitative results demonstrate that our model can properly localize unseen objects on HICO-DET and V-COCO datasets.
We introduce a new challenge for computer and robotic vision, the first ACRV Robotic Vision Challenge, Probabilistic Object Detection. Probabilistic object detection is a new variation on traditional object detection tasks, requiring estimates of spatial and semantic uncertainty. We extend the traditional bounding box format of object detection to express spatial uncertainty using gaussian distributions for the box corners. The challenge introduces a new test dataset of video sequences, which are designed to more closely resemble the kind of data available to a robotic system. We evaluate probabilistic detections using a new probability-based detection quality (PDQ) measure. The goal in creating this challenge is to draw the computer and robotic vision communities together, toward applying object detection solutions for practical robotics applications.
In this paper, we propose SparseDet for end-to-end 3D object detection from point cloud. Existing works on 3D object detection rely on dense object candidates over all locations in a 3D or 2D grid following the mainstream methods for object detection in 2D images. However, this dense paradigm requires expertise in data to fulfill the gap between label and detection. As a new detection paradigm, SparseDet maintains a fixed set of learnable proposals to represent latent candidates and directly perform classification and localization for 3D objects through stacked transformers. It demonstrates that effective 3D object detection can be achieved with none of post-processing such as redundant removal and non-maximum suppression. With a properly designed network, SparseDet achieves highly competitive detection accuracy while running with a more efficient speed of 34.5 FPS. We believe this end-to-end paradigm of SparseDet will inspire new thinking on the sparsity of 3D object detection.
We present Open Images V4, a dataset of 9.2M images with unified annotations for image classification, object detection and visual relationship detection. The images have a Creative Commons Attribution license that allows to share and adapt the material, and they have been collected from Flickr without a predefined list of class names or tags, leading to natural class statistics and avoiding an initial design bias. Open Images V4 offers large scale across several dimensions: 30.1M image-level labels for 19.8k concepts, 15.4M bounding boxes for 600 object classes, and 375k visual relationship annotations involving 57 classes. For object detection in particular, we provide 15x more bounding boxes than the next largest datasets (15.4M boxes on 1.9M images). The images often show complex scenes with several objects (8 annotated objects per image on average). We annotated visual relationships between them, which support visual relationship detection, an emerging task that requires structured reasoning. We provide in-depth comprehensive statistics about the dataset, we validate the quality of the annotations, and we study how the performance of many modern models evolves with increasing amounts of training data. We hope that the scale, quality, and variety of Open Images V4 will foster further research and innovation even beyond the areas of image classification, object detection, and visual relationship detection.
Object detection is a fundamental task in computer vision and image processing. Current deep learning based object detectors have been highly successful with abundant labeled data. But in real life, it is not guaranteed that each object category has enough labeled samples for training. These large object detectors are easy to overfit when the training data is limited. Therefore, it is necessary to introduce few-shot learning and zero-shot learning into object detection, which can be named low-shot object detection together. Low-Shot Object Detection (LSOD) aims to detect objects from a few or even zero labeled data, which can be categorized into few-shot object detection (FSOD) and zero-shot object detection (ZSD), respectively. This paper conducts a comprehensive survey for deep learning based FSOD and ZSD. First, this survey classifies methods for FSOD and ZSD into different categories and discusses the pros and cons of them. Second, this survey reviews dataset settings and evaluation metrics for FSOD and ZSD, then analyzes the performance of different methods on these benchmarks. Finally, this survey discusses future challenges and promising directions for FSOD and ZSD.