In this work, we present Detective - an attentive object detector that identifies objects in images in a sequential manner. Our network is based on an encoder-decoder architecture, where the encoder is a convolutional neural network, and the decoder is a convolutional recurrent neural network coupled with an attention mechanism. At each iteration, our decoder focuses on the relevant parts of the image using an attention mechanism, and then estimates the object's class and the bounding box coordinates. Current object detection models generate dense predictions and rely on post-processing to remove duplicate predictions. Detective is a sparse object detector that generates a single bounding box per object instance. However, training a sparse object detector is challenging, as it requires the model to reason at the instance level and not just at the class and spatial levels. We propose a training mechanism based on the Hungarian algorithm and a loss that balances the localization and classification tasks. This allows Detective to achieve promising results on the PASCAL VOC object detection dataset. Our experiments demonstrate that sparse object detection is possible and has a great potential for future developments in applications where the order of the objects to be predicted is of interest.
3D object detection often involves complicated training and testing pipelines, which require substantial domain knowledge about individual datasets. Inspired by recent non-maximum suppression-free 2D object detection models, we propose a 3D object detection architecture on point clouds. Our method models 3D object detection as message passing on a dynamic graph, generalizing the DGCNN framework to predict a set of objects. In our construction, we remove the necessity of post-processing via object confidence aggregation or non-maximum suppression. To facilitate object detection from sparse point clouds, we also propose a set-to-set distillation approach customized to 3D detection. This approach aligns the outputs of the teacher model and the student model in a permutation-invariant fashion, significantly simplifying knowledge distillation for the 3D detection task. Our method achieves state-of-the-art performance on autonomous driving benchmarks. We also provide abundant analysis of the detection model and distillation framework.
Traffic scene perception (TSP) aims to real-time extract accurate on-road environment information, which in- volves three phases: detection of objects of interest, recognition of detected objects, and tracking of objects in motion. Since recognition and tracking often rely on the results from detection, the ability to detect objects of interest effectively plays a crucial role in TSP. In this paper, we focus on three important classes of objects: traffic signs, cars, and cyclists. We propose to detect all the three important objects in a single learning based detection framework. The proposed framework consists of a dense feature extractor and detectors of three important classes. Once the dense features have been extracted, these features are shared with all detectors. The advantage of using one common framework is that the detection speed is much faster, since all dense features need only to be evaluated once in the testing phase. In contrast, most previous works have designed specific detectors using different features for each of these objects. To enhance the feature robustness to noises and image deformations, we introduce spatially pooled features as a part of aggregated channel features. In order to further improve the generalization performance, we propose an object subcategorization method as a means of capturing intra-class variation of objects. We experimentally demonstrate the effectiveness and efficiency of the proposed framework in three detection applications: traffic sign detection, car detection, and cyclist detection. The proposed framework achieves the competitive performance with state-of- the-art approaches on several benchmark datasets.
We propose a single-stage Human-Object Interaction (HOI) detection method that has outperformed all existing methods on HICO-DET dataset at 37 fps on a single Titan XP GPU. It is the first real-time HOI detection method. Conventional HOI detection methods are composed of two stages, i.e., human-object proposals generation, and proposals classification. Their effectiveness and efficiency are limited by the sequential and separate architecture. In this paper, we propose a Parallel Point Detection and Matching (PPDM) HOI detection framework. In PPDM, an HOI is defined as a point triplet < human point, interaction point, object point>. Human and object points are the center of the detection boxes, and the interaction point is the midpoint of the human and object points. PPDM contains two parallel branches, namely point detection branch and point matching branch. The point detection branch predicts three points. Simultaneously, the point matching branch predicts two displacements from the interaction point to its corresponding human and object points. The human point and the object point originated from the same interaction point are considered as matched pairs. In our novel parallel architecture, the interaction points implicitly provide context and regularization for human and object detection. The isolated detection boxes are unlikely to form meaning HOI triplets are suppressed, which increases the precision of HOI detection. Moreover, the matching between human and object detection boxes is only applied around limited numbers of filtered candidate interaction points, which saves much computational cost. Additionally, we build a new applicationoriented database named HOI-A, which severs as a good supplement to the existing datasets. The source code and the dataset will be made publicly available to facilitate the development of HOI detection.
Object detection plays an important role in various visual applications. However, the precision and speed of detector are usually contradictory. One main reason for fast detectors' precision reduction is that small objects are hard to be detected. To address this problem, we propose a multiple receptive field and small-object-focusing weakly-supervised segmentation network (MRFSWSnet) to achieve fast object detection. In MRFSWSnet, multiple receptive fields block (MRF) is used to pay attention to the object and its adjacent background's different spatial location with different weights to enhance the feature's discriminability. In addition, in order to improve the accuracy of small object detection, a small-object-focusing weakly-supervised segmentation module which only focuses on small object instead of all objects is integrated into the detection network for auxiliary training to improve the precision of small object detection. Extensive experiments show the effectiveness of our method on both PASCAL VOC and MS COCO detection datasets. In particular, with a lower resolution version of 300x300, MRFSWSnet achieves 80.9% mAP on VOC2007 test with an inference speed of 15 milliseconds per frame, which is the state-of-the-art detector among real-time detectors.
Accurate detection and tracking of objects is vital for effective video understanding. In previous work, the two tasks have been combined in a way that tracking is based heavily on detection, but the detection benefits marginally from the tracking. To increase synergy, we propose to more tightly integrate the tasks by conditioning the object detection in the current frame on tracklets computed in prior frames. With this approach, the object detection results not only have high detection responses, but also improved coherence with the existing tracklets. This greater coherence leads to estimated object trajectories that are smoother and more stable than the jittered paths obtained without tracklet-conditioned detection. Over extensive experiments, this approach is shown to achieve state-of-the-art performance in terms of both detection and tracking accuracy, as well as noticeable improvements in tracking stability.
Salient object detection is evaluated using binary ground truth with the labels being salient object class and background. In this paper, we corroborate based on three subjective experiments on a novel image dataset that objects in natural images are inherently perceived to have varying levels of importance. Our dataset, named SalMoN (saliency in multi-object natural images), has 588 images containing multiple objects. The subjective experiments performed record spontaneous attention and perception through eye fixation duration, point clicking and rectangle drawing. As object saliency in a multi-object image is inherently multi-level, we propose that salient object detection must be evaluated for the capability to detect all multi-level salient objects apart from the salient object class detection capability. For this purpose, we generate multi-level maps as ground truth corresponding to all the dataset images using the results of the subjective experiments, with the labels being multi-level salient objects and background. We then propose the use of mean absolute error, Kendall's rank correlation and average area under precision-recall curve to evaluate existing salient object detection methods on our multi-level saliency ground truth dataset. Approaches that represent saliency detection on images as local-global hierarchical processing of a graph perform well in our dataset.
While generic object detection has achieved large improvements with rich feature hierarchies from deep nets, detecting small objects with poor visual cues remains challenging. Motion cues from multiple frames may be more informative for detecting such hard-to-distinguish objects in each frame. However, how to encode discriminative motion patterns, such as deformations and pose changes that characterize objects, has remained an open question. To learn them and thereby realize small object detection, we present a neural model called the Recurrent Correlational Network, where detection and tracking are jointly performed over a multi-frame representation learned through a single, trainable, and end-to-end network. A convolutional long short-term memory network is utilized for learning informative appearance change for detection, while learned representation is shared in tracking for enhancing its performance. In experiments with datasets containing images of scenes with small flying objects, such as birds and unmanned aerial vehicles, the proposed method yielded consistent improvements in detection performance over deep single-frame detectors and existing motion-based detectors. Furthermore, our network performs as well as state-of-the-art generic object trackers when it was evaluated as a tracker on the bird dataset.
Video object detection is challenging because objects that are easily detected in one frame may be difficult to detect in another frame within the same clip. Recently, there have been major advances for doing object detection in a single image. These methods typically contain three phases: (i) object proposal generation (ii) object classification and (iii) post-processing. We propose a modification of the post-processing phase that uses high-scoring object detections from nearby frames to boost scores of weaker detections within the same clip. We show that our method obtains superior results to state-of-the-art single image object detection techniques. Our method placed 3rd in the video object detection (VID) task of the ImageNet Large Scale Visual Recognition Challenge 2015 (ILSVRC2015).
Object detection is a major challenge in computer vision, involving both object classification and object localization within a scene. While deep neural networks have been shown in recent years to yield very powerful techniques for tackling the challenge of object detection, one of the biggest challenges with enabling such object detection networks for widespread deployment on embedded devices is high computational and memory requirements. Recently, there has been an increasing focus in exploring small deep neural network architectures for object detection that are more suitable for embedded devices, such as Tiny YOLO and SqueezeDet. Inspired by the efficiency of the Fire microarchitecture introduced in SqueezeNet and the object detection performance of the single-shot detection macroarchitecture introduced in SSD, this paper introduces Tiny SSD, a single-shot detection deep convolutional neural network for real-time embedded object detection that is composed of a highly optimized, non-uniform Fire sub-network stack and a non-uniform sub-network stack of highly optimized SSD-based auxiliary convolutional feature layers designed specifically to minimize model size while maintaining object detection performance. The resulting Tiny SSD possess a model size of 2.3MB (~26X smaller than Tiny YOLO) while still achieving an mAP of 61.3% on VOC 2007 (~4.2% higher than Tiny YOLO). These experimental results show that very small deep neural network architectures can be designed for real-time object detection that are well-suited for embedded scenarios.
The transformer networks, which originate from machine translation, are particularly good at modeling long-range dependencies within a long sequence. Currently, the transformer networks are making revolutionary progress in various vision tasks ranging from high-level classification tasks to low-level dense prediction tasks. In this paper, we conduct research on applying the transformer networks for salient object detection (SOD). Specifically, we adopt the dense transformer backbone for fully supervised RGB image based SOD, RGB-D image pair based SOD, and weakly supervised SOD via scribble supervision. As an extension, we also apply our fully supervised model to the task of camouflaged object detection (COD) for camouflaged object segmentation. For the fully supervised models, we define the dense transformer backbone as feature encoder, and design a very simple decoder to produce a one channel saliency map (or camouflage map for the COD task). For the weakly supervised model, as there exists no structure information in the scribble annotation, we first adopt the recent proposed Gated-CRF loss to effectively model the pair-wise relationships for accurate model prediction. Then, we introduce self-supervised learning strategy to push the model to produce scale-invariant predictions, which is proven effective for weakly supervised models and models trained on small training datasets. Extensive experimental results on various SOD and COD tasks (fully supervised RGB image based SOD, fully supervised RGB-D image pair based SOD, weakly supervised SOD via scribble supervision, and fully supervised RGB image based COD) illustrate that transformer networks can transform salient object detection and camouflaged object detection, leading to new benchmarks for each related task.
It has become apparent that a Gaussian center bias can serve as an important prior for visual saliency detection, which has been demonstrated for predicting human eye fixations and salient object detection. Tseng et al. have shown that the photographer's tendency to place interesting objects in the center is a likely cause for the center bias of eye fixations. We investigate the influence of the photographer's center bias on salient object detection, extending our previous work. We show that the centroid locations of salient objects in photographs of Achanta and Liu's data set in fact correlate strongly with a Gaussian model. This is an important insight, because it provides an empirical motivation and justification for the integration of such a center bias in salient object detection algorithms and helps to understand why Gaussian models are so effective. To assess the influence of the center bias on salient object detection, we integrate an explicit Gaussian center bias model into two state-of-the-art salient object detection algorithms. This way, first, we quantify the influence of the Gaussian center bias on pixel- and segment-based salient object detection. Second, we improve the performance in terms of F1 score, Fb score, area under the recall-precision curve, area under the receiver operating characteristic curve, and hit-rate on the well-known data set by Achanta and Liu. Third, by debiasing Cheng et al.'s region contrast model, we exemplarily demonstrate that implicit center biases are partially responsible for the outstanding performance of state-of-the-art algorithms. Last but not least, as a result of debiasing Cheng et al.'s algorithm, we introduce a non-biased salient object detection method, which is of interest for applications in which the image data is not likely to have a photographer's center bias (e.g., image data of surveillance cameras or autonomous robots).
The detection of object states in images (State Detection - SD) is a problem of both theoretical and practical importance and it is tightly interwoven with other important computer vision problems, such as action recognition and affordance detection. It is also highly relevant to any entity that needs to reason and act in dynamic domains, such as robotic systems and intelligent agents. Despite its importance, up to now, the research on this problem has been limited. In this paper, we attempt a systematic study of the SD problem. First, we introduce the Object State Detection Dataset (OSDD), a new publicly available dataset consisting of more than 19,000 annotations for 18 object categories and 9 state classes. Second, using a standard deep learning framework used for Object Detection (OD), we conduct a number of appropriately designed experiments, towards an in-depth study of the behavior of the SD problem. This study enables the setup of a baseline on the performance of SD, as well as its relative performance in comparison to OD, in a variety of scenarios. Overall, the experimental outcomes confirm that SD is harder than OD and that tailored SD methods need to be developed for addressing effectively this significant problem.
Weakly-supervised object detection (WOD) is a challenging problems in computer vision. The key problem is to simultaneously infer the exact object locations in the training images and train the object detectors, given only the training images with weak image-level labels. Intuitively, by simulating the selective attention mechanism of human visual system, saliency detection technique can select attractive objects in scenes and thus is a potential way to provide useful priors for WOD. However, the way to adopt saliency detection in WOD is not trivial since the detected saliency region might be possibly highly ambiguous in complex cases. To this end, this paper first comprehensively analyzes the challenges in applying saliency detection to WOD. Then, we make one of the earliest efforts to bridge saliency detection to WOD via the self-paced curriculum learning, which can guide the learning procedure to gradually achieve faithful knowledge of multi-class objects from easy to hard. The experimental results demonstrate that the proposed approach can successfully bridge saliency detection and WOD tasks and achieve the state-of-the-art object detection results under the weak supervision.
Visual salient object detection (SOD) aims at finding the salient object(s) that attract human attention, while camouflaged object detection (COD) on the contrary intends to discover the camouflaged object(s) that hidden in the surrounding. In this paper, we propose a paradigm of leveraging the contradictory information to enhance the detection ability of both salient object detection and camouflaged object detection. We start by exploiting the easy positive samples in the COD dataset to serve as hard positive samples in the SOD task to improve the robustness of the SOD model. Then, we introduce a similarity measure module to explicitly model the contradicting attributes of these two tasks. Furthermore, considering the uncertainty of labeling in both tasks' datasets, we propose an adversarial learning network to achieve both higher order similarity measure and network confidence estimation. Experimental results on benchmark datasets demonstrate that our solution leads to state-of-the-art (SOTA) performance for both tasks.
Object proposals for detecting moving or static video objects need to address issues such as speed, memory complexity and temporal consistency. We propose an efficient Video Object Proposal (VOP) generation method and show its efficacy in learning a better video object detector. A deep-learning based video object detector learned using the proposed VOP achieves state-of-the-art detection performance on the Youtube-Objects dataset. We further propose a clustering of VOPs which can efficiently be used for detecting objects in video in a streaming fashion. As opposed to applying per-frame convolutional neural network (CNN) based object detection, our proposed method called Objects in Video Enabler thRough LAbel Propagation (OVERLAP) needs to classify only a small fraction of all candidate proposals in every video frame through streaming clustering of object proposals and class-label propagation. Source code will be made available soon.
Weakly supervised object detection (WSOD) using only image-level annotations has attracted growing attention over the past few years. Existing approaches using multiple instance learning easily fall into local optima, because such mechanism tends to learn from the most discriminative object in an image for each category. Therefore, these methods suffer from missing object instances which degrade the performance of WSOD. To address this problem, this paper introduces an end-to-end object instance mining (OIM) framework for weakly supervised object detection. OIM attempts to detect all possible object instances existing in each image by introducing information propagation on the spatial and appearance graphs, without any additional annotations. During the iterative learning process, the less discriminative object instances from the same class can be gradually detected and utilized for training. In addition, we design an object instance reweighted loss to learn larger portion of each object instance to further improve the performance. The experimental results on two publicly available databases, VOC 2007 and 2012, demonstrate the efficacy of proposed approach.
Salient object detection or salient region detection models, diverging from fixation prediction models, have traditionally been dealing with locating and segmenting the most salient object or region in a scene. While the notion of most salient object is sensible when multiple objects exist in a scene, current datasets for evaluation of saliency detection approaches often have scenes with only one single object. We introduce three main contributions in this paper: First, we take an indepth look at the problem of salient object detection by studying the relationship between where people look in scenes and what they choose as the most salient object when they are explicitly asked. Based on the agreement between fixations and saliency judgments, we then suggest that the most salient object is the one that attracts the highest fraction of fixations. Second, we provide two new less biased benchmark datasets containing scenes with multiple objects that challenge existing saliency models. Indeed, we observed a severe drop in performance of 8 state-of-the-art models on our datasets (40% to 70%). Third, we propose a very simple yet powerful model based on superpixels to be used as a baseline for model evaluation and comparison. While on par with the best models on MSRA-5K dataset, our model wins over other models on our data highlighting a serious drawback of existing models, which is convoluting the processes of locating the most salient object and its segmentation. We also provide a review and statistical analysis of some labeled scene datasets that can be used for evaluating salient object detection models. We believe that our work can greatly help remedy the over-fitting of models to existing biased datasets and opens new venues for future research in this fast-evolving field.
The rapid growth of real-time huge data capturing has pushed the deep learning and data analytic computing to the edge systems. Real-time object recognition on the edge is one of the representative deep neural network (DNN) powered edge systems for real-world mission-critical applications, such as autonomous driving and augmented reality. While DNN powered object detection edge systems celebrate many life-enriching opportunities, they also open doors for misuse and abuse. This paper presents three Targeted adversarial Objectness Gradient attacks, coined as TOG, which can cause the state-of-the-art deep object detection networks to suffer from object-vanishing, object-fabrication, and object-mislabeling attacks. We also present a universal objectness gradient attack to use adversarial transferability for black-box attacks, which is effective on any inputs with negligible attack time cost, low human perceptibility, and particularly detrimental to object detection edge systems. We report our experimental measurements using two benchmark datasets (PASCAL VOC and MS COCO) on two state-of-the-art detection algorithms (YOLO and SSD). The results demonstrate serious adversarial vulnerabilities and the compelling need for developing robust object detection systems.