Salient object detection or salient region detection models, diverging from fixation prediction models, have traditionally been dealing with locating and segmenting the most salient object or region in a scene. While the notion of most salient object is sensible when multiple objects exist in a scene, current datasets for evaluation of saliency detection approaches often have scenes with only one single object. We introduce three main contributions in this paper: First, we take an indepth look at the problem of salient object detection by studying the relationship between where people look in scenes and what they choose as the most salient object when they are explicitly asked. Based on the agreement between fixations and saliency judgments, we then suggest that the most salient object is the one that attracts the highest fraction of fixations. Second, we provide two new less biased benchmark datasets containing scenes with multiple objects that challenge existing saliency models. Indeed, we observed a severe drop in performance of 8 state-of-the-art models on our datasets (40% to 70%). Third, we propose a very simple yet powerful model based on superpixels to be used as a baseline for model evaluation and comparison. While on par with the best models on MSRA-5K dataset, our model wins over other models on our data highlighting a serious drawback of existing models, which is convoluting the processes of locating the most salient object and its segmentation. We also provide a review and statistical analysis of some labeled scene datasets that can be used for evaluating salient object detection models. We believe that our work can greatly help remedy the over-fitting of models to existing biased datasets and opens new venues for future research in this fast-evolving field.
This paper presents an approach to detect out-of-context (OOC) objects in an image. Given an image with a set of objects, our goal is to determine if an object is inconsistent with the scene context and detect the OOC object with a bounding box. In this work, we consider commonly explored contextual relations such as co-occurrence relations, the relative size of an object with respect to other objects, and the position of the object in the scene. We posit that contextual cues are useful to determine object labels for in-context objects and inconsistent context cues are detrimental to determining object labels for out-of-context objects. To realize this hypothesis, we propose a graph contextual reasoning network (GCRN) to detect OOC objects. GCRN consists of two separate graphs to predict object labels based on the contextual cues in the image: 1) a representation graph to learn object features based on the neighboring objects and 2) a context graph to explicitly capture contextual cues from the neighboring objects. GCRN explicitly captures the contextual cues to improve the detection of in-context objects and identify objects that violate contextual relations. In order to evaluate our approach, we create a large-scale dataset by adding OOC object instances to the COCO images. We also evaluate on recent OCD benchmark. Our results show that GCRN outperforms competitive baselines in detecting OOC objects and correctly detecting in-context objects.
The rapid growth of real-time huge data capturing has pushed the deep learning and data analytic computing to the edge systems. Real-time object recognition on the edge is one of the representative deep neural network (DNN) powered edge systems for real-world mission-critical applications, such as autonomous driving and augmented reality. While DNN powered object detection edge systems celebrate many life-enriching opportunities, they also open doors for misuse and abuse. This paper presents three Targeted adversarial Objectness Gradient attacks, coined as TOG, which can cause the state-of-the-art deep object detection networks to suffer from object-vanishing, object-fabrication, and object-mislabeling attacks. We also present a universal objectness gradient attack to use adversarial transferability for black-box attacks, which is effective on any inputs with negligible attack time cost, low human perceptibility, and particularly detrimental to object detection edge systems. We report our experimental measurements using two benchmark datasets (PASCAL VOC and MS COCO) on two state-of-the-art detection algorithms (YOLO and SSD). The results demonstrate serious adversarial vulnerabilities and the compelling need for developing robust object detection systems.
Object detection models perform well at localizing and classifying objects that they are shown during training. However, due to the difficulty and cost associated with creating and annotating detection datasets, trained models detect a limited number of object types with unknown objects treated as background content. This hinders the adoption of conventional detectors in real-world applications like large-scale object matching, visual grounding, visual relation prediction, obstacle detection (where it is more important to determine the presence and location of objects than to find specific types), etc. We propose class-agnostic object detection as a new problem that focuses on detecting objects irrespective of their object-classes. Specifically, the goal is to predict bounding boxes for all objects in an image but not their object-classes. The predicted boxes can then be consumed by another system to perform application-specific classification, retrieval, etc. We propose training and evaluation protocols for benchmarking class-agnostic detectors to advance future research in this domain. Finally, we propose (1) baseline methods and (2) a new adversarial learning framework for class-agnostic detection that forces the model to exclude class-specific information from features used for predictions. Experimental results show that adversarial learning improves class-agnostic detection efficacy.
Object detection is a fundamental step for automated video analysis in many vision applications. Object detection in a video is usually performed by object detectors or background subtraction techniques. Often, an object detector requires manually labeled examples to train a binary classifier, while background subtraction needs a training sequence that contains no objects to build a background model. To automate the analysis, object detection without a separate training phase becomes a critical task. People have tried to tackle this task by using motion information. But existing motion-based methods are usually limited when coping with complex scenarios such as nonrigid motion and dynamic background. In this paper, we show that above challenges can be addressed in a unified framework named DEtecting Contiguous Outliers in the LOw-rank Representation (DECOLOR). This formulation integrates object detection and background learning into a single process of optimization, which can be solved by an alternating algorithm efficiently. We explain the relations between DECOLOR and other sparsity-based methods. Experiments on both simulated data and real sequences demonstrate that DECOLOR outperforms the state-of-the-art approaches and it can work effectively on a wide range of complex scenarios.
Many recent studies have shown that deep neural models are vulnerable to adversarial samples: images with imperceptible perturbations, for example, can fool image classifiers. In this paper, we generate adversarial examples for object detection, which entails detecting bounding boxes around multiple objects present in the image and classifying them at the same time, making it a harder task than against image classification. We specifically aim to attack the widely used Faster R-CNN by changing the predicted label for a particular object in an image: where prior work has targeted one specific object (a stop sign), we generalise to arbitrary objects, with the key challenge being the need to change the labels of all bounding boxes for all instances of that object type. To do so, we propose a novel method, named Pick-Object-Attack. Pick-Object-Attack successfully adds perturbations only to bounding boxes for the targeted object, preserving the labels of other detected objects in the image. In terms of perceptibility, the perturbations induced by the method are very small. Furthermore, for the first time, we examine the effect of adversarial attacks on object detection in terms of a downstream task, image captioning; we show that where a method that can modify all object types leads to very obvious changes in captions, the changes from our constrained attack are much less apparent.
In contrast to object recognition models, humans do not blindly trust their perception when building representations of the world, instead recruiting metacognition to detect percepts that are unreliable or false, such as when we realize that we mistook one object for another. We propose METAGEN, an unsupervised model that enhances object recognition models through a metacognition. Given noisy output from an object-detection model, METAGEN learns a meta-representation of how its perceptual system works and uses it to infer the objects in the world responsible for the detections. METAGEN achieves this by conditioning its inference on basic principles of objects that even human infants understand (known as Spelke principles: object permanence, cohesion, and spatiotemporal continuity). We test METAGEN on a variety of state-of-the-art object detection neural networks. We find that METAGEN quickly learns an accurate metacognitive representation of the neural network, and that this improves detection accuracy by filling in objects that the detection model missed and removing hallucinated objects. This approach enables generalization to out-of-sample data and outperforms comparison models that lack a metacognition.
Object detection in videos is an important task in computer vision for various applications such as object tracking, video summarization and video search. Although great progress has been made in improving the accuracy of object detection in recent years due to improved techniques for training and deploying deep neural networks, they are computationally very intensive. For example, processing a video at 300x300 resolution using the SSD300 (Single Shot Detector) object detection network with VGG16 as backbone at 30 fps requires 1.87 trillion FLOPS/s. In order to address this challenge, we make two important observations in the context of videos. In some scenarios, most of the regions in a video frame are background and the salient objects occupy only a small fraction of the area in the frame. Further, in a video, there is a strong temporal correlation between consecutive frames. Based on these observations, we propose Pack and Detect (PaD) to reduce the computational requirements for the task of object detection in videos using neural networks. In PaD, the input video frame is processed at full size in selected frames called anchor frames. In the frames between the anchor frames, namely inter-anchor frames, the regions of interest(ROI) are identified based on the detections in the previous frame. We propose an algorithm to pack the ROI's of each inter-anchor frame together in a lower sized frame. In order to maintain the accuracy of object detection, the proposed algorithm expands the ROI's greedily to provide more background information to the detector. The computational requirements are reduced due to the lower size of the input. This method can potentially reduce the number of FLOPS required for a frame by 4x. Tuning the algorithm parameters can provide a 1.3x increase in throughput with only a 2.5% drop in accuracy.
This paper proposes a novel approach to create an automated visual surveillance system which is very efficient in detecting and tracking moving objects in a video captured by moving camera without any apriori information about the captured scene. Separating foreground from the background is challenging job in videos captured by moving camera as both foreground and background information change in every consecutive frames of the image sequence; thus a pseudo-motion is perceptive in background. In the proposed algorithm, the pseudo-motion in background is estimated and compensated using phase correlation of consecutive frames based on the principle of Fourier shift theorem. Then a method is proposed to model an acting background from recent history of commonality of the current frame and the foreground is detected by the differences between the background model and the current frame. Further exploiting the recent history of dissimilarities of the current frame, actual moving objects are detected in the foreground. Next, a two-stepped morphological operation is proposed to refine the object region for an optimum object size. Each object is attributed by its centroid, dimension and three highest peaks of its gray value histogram. Finally, each object is tracked using Kalman filter based on its attributes. The major advantage of this algorithm over most of the existing object detection and tracking algorithms is that, it does not require initialization of object position in the first frame or training on sample data to perform. Performance of the algorithm is tested on benchmark videos containing variable background and very satisfiable results is achieved. The performance of the algorithm is also comparable with some of the state-of-the-art algorithms for object detection and tracking.
Object detection is a very important function of visual perception systems. Since the early days of classical object detection based on HOG to modern deep learning based detectors, object detection has improved in accuracy. Two stage detectors usually have higher accuracy than single stage ones. Both types of detectors use some form of quantization of the search space of rectangular regions of image. There are far more of the quantized elements than true objects. The way these bounding boxes are filtered out possibly results in the false positive and false negatives. This empirical experimental study explores ways of reducing false positives and negatives with labelled data.. In the process also discovered insufficient labelling in Openimage 2019 Object Detection dataset.