With the human pursuit of knowledge, open-set object detection (OSOD) has been designed to identify unknown objects in a dynamic world. However, an issue with the current setting is that all the predicted unknown objects share the same category as "unknown", which require incremental learning via a human-in-the-loop approach to label novel classes. In order to address this problem, we present a new task, namely Open-Set Object Detection and Discovery (OSODD). This new task aims to extend the ability of open-set object detectors to further discover the categories of unknown objects based on their visual appearance without human effort. We propose a two-stage method that first uses an open-set object detector to predict both known and unknown objects. Then, we study the representation of predicted objects in an unsupervised manner and discover new categories from the set of unknown objects. With this method, a detector is able to detect objects belonging to known classes and define novel categories for objects of unknown classes with minimal supervision. We show the performance of our model on the MS-COCO dataset under a thorough evaluation protocol. We hope that our work will promote further research towards a more robust real-world detection system.
Salient object detection in complex scenes and environments is a challenging research topic. Most works focus on RGB-based salient object detection, which limits its performance of real-life applications when confronted with adverse conditions such as dark environments and complex backgrounds. Taking advantage of RGB and thermal infrared images becomes a new research direction for detecting salient object in complex scenes recently, as thermal infrared spectrum imaging provides the complementary information and has been applied to many computer vision tasks. However, current research for RGBT salient object detection is limited by the lack of a large-scale dataset and comprehensive benchmark. This work contributes such a RGBT image dataset named VT5000, including 5000 spatially aligned RGBT image pairs with ground truth annotations. VT5000 has 11 challenges collected in different scenes and environments for exploring the robustness of algorithms. With this dataset, we propose a powerful baseline approach, which extracts multi-level features within each modality and aggregates these features of all modalities with the attention mechanism, for accurate RGBT salient object detection. Extensive experiments show that the proposed baseline approach outperforms the state-of-the-art methods on VT5000 dataset and other two public datasets. In addition, we carry out a comprehensive analysis of different algorithms of RGBT salient object detection on VT5000 dataset, and then make several valuable conclusions and provide some potential research directions for RGBT salient object detection.
Existing CNNs-based salient object detection (SOD) heavily depends on the large-scale pixel-level annotations, which is labor-intensive, time-consuming, and expensive. By contrast, the sparse annotations become appealing to the salient object detection community. However, few efforts are devoted to learning salient object detection from sparse annotations, especially in the remote sensing field. In addition, the sparse annotation usually contains scanty information, which makes it challenging to train a well-performing model, resulting in its performance largely lagging behind the fully-supervised models. Although some SOD methods adopt some prior cues to improve the detection performance, they usually lack targeted discrimination of object boundaries and thus provide saliency maps with poor boundary localization. To this end, in this paper, we propose a novel weakly-supervised salient object detection framework to predict the saliency of remote sensing images from sparse scribble annotations. To implement it, we first construct the scribble-based remote sensing saliency dataset by relabelling an existing large-scale SOD dataset with scribbles, namely S-EOR dataset. After that, we present a novel scribble-based boundary-aware network (SBA-Net) for remote sensing salient object detection. Specifically, we design a boundary-aware module (BAM) to explore the object boundary semantics, which is explicitly supervised by the high-confidence object boundary (pseudo) labels generated by the boundary label generation (BLG) module, forcing the model to learn features that highlight the object structure and thus boosting the boundary localization of objects. Then, the boundary semantics are integrated with high-level features to guide the salient object detection under the supervision of scribble labels.
Object detection and recognition are important problems in computer vision. Since these problems are meta-heuristic, despite a lot of research, practically usable, intelligent, real-time, and dynamic object detection/recognition methods are still unavailable. We propose a new object detection/recognition method, which improves over the existing methods in every stage of the object detection/recognition process. In addition to the usual features, we propose to use geometric shapes, like linear cues, ellipses and quadrangles, as additional features. The full potential of geometric cues is exploited by using them to extract other features in a robust, computationally efficient, and less meta-heuristic manner. We also propose a new hierarchical codebook, which provides good generalization and discriminative properties. The codebook enables fast multi-path inference mechanisms based on propagation of conditional likelihoods, that make it robust to occlusion and noise. It has the capability of dynamic learning. We also propose a new learning method that has generative and discriminative learning capabilities, does not need large and fully supervised training dataset, and is capable of online learning. The preliminary work of detecting geometric shapes in real images has been completed. This preliminary work is the focus of this report. Future path for realizing the proposed object detection/recognition method is also discussed in brief.
Recent work in adversarial machine learning started to focus on the visual perception in autonomous driving and studied Adversarial Examples (AEs) for object detection models. However, in such visual perception pipeline the detected objects must also be tracked, in a process called Multiple Object Tracking (MOT), to build the moving trajectories of surrounding obstacles. Since MOT is designed to be robust against errors in object detection, it poses a general challenge to existing attack techniques that blindly target objection detection: we find that a success rate of over 98% is needed for them to actually affect the tracking results, a requirement that no existing attack technique can satisfy. In this paper, we are the first to study adversarial machine learning attacks against the complete visual perception pipeline in autonomous driving, and discover a novel attack technique, tracker hijacking, that can effectively fool MOT using AEs on object detection. Using our technique, successful AEs on as few as one single frame can move an existing object in to or out of the headway of an autonomous vehicle to cause potential safety hazards. We perform evaluation using the Berkeley Deep Drive dataset and find that on average when 3 frames are attacked, our attack can have a nearly 100% success rate while attacks that blindly target object detection only have up to 25%.
Salient object detection has seen remarkable progress driven by deep learning techniques. However, most of deep learning based salient object detection methods are black-box in nature and lacking in interpretability. This paper proposes the first self-explanatory saliency detection network that explicitly exploits low- and high-level features for salient object detection. We demonstrate that such supportive clues not only significantly enhances performance of salient object detection but also gives better justified detection results. More specifically, we develop a multi-stage saliency encoder to extract multi-scale features which contain both low- and high-level saliency context. Dense short- and long-range connections are introduced to reuse these features iteratively. Benefiting from the direct access to low- and high-level features, the proposed saliency encoder can not only model the object context but also preserve the boundary. Furthermore, a self-explanatory generator is proposed to interpret how the proposed saliency encoder or other deep saliency models making decisions. The generator simulates the absence of interesting features by preventing these features from contributing to the saliency classifier and estimates the corresponding saliency prediction without these features. A comparison function, saliency explanation, is defined to measure the prediction changes between deep saliency models and corresponding generator. Through visualizing the differences, we can interpret the capability of different deep neural networks based saliency detection models and demonstrate that our proposed model indeed uses more reasonable structure for salient object detection. Extensive experiments on five popular benchmark datasets and the visualized saliency explanation demonstrate that the proposed method provides new state-of-the-art.
Object detection has been a challenging task in computer vision. Although significant progress has been made in object detection with deep neural networks, the attention mechanism is far from development. In this paper, we propose the hybrid attention mechanism for single-stage object detection. First, we present the modules of spatial attention, channel attention and aligned attention for single-stage object detection. In particular, stacked dilated convolution layers with symmetrically fixed rates are constructed to learn spatial attention. The channel attention is proposed with the cross-level group normalization and squeeze-and-excitation module. Aligned attention is constructed with organized deformable filters. Second, the three kinds of attention are unified to construct the hybrid attention mechanism. We then embed the hybrid attention into Retina-Net and propose the efficient single-stage HAR-Net for object detection. The attention modules and the proposed HAR-Net are evaluated on the COCO detection dataset. Experiments demonstrate that hybrid attention can significantly improve the detection accuracy and the HAR-Net can achieve the state-of-the-art 45.8\% mAP, outperform existing single-stage object detectors.
Tracking players in sports videos is commonly done in a tracking-by-detection framework, first detecting players in each frame, and then performing association over time. While for some sports tracking players is sufficient for game analysis, sports like hockey, tennis and polo may require additional detections, that include the object the player is holding (e.g. racket, stick). The baseline solution for this problem involves detecting these objects as separate classes, and matching them to player detections based on the intersection over union (IoU). This approach, however, leads to poor matching performance in crowded situations, as it does not model the relationship between players and objects. In this paper, we propose a simple yet efficient way to detect and match players and related objects at once without extra cost, by considering an implicit association for prediction of multiple objects through the same proposal box. We evaluate the method on a dataset of broadcast ice hockey videos, and also a new public dataset we introduce called COCO +Torso. On the ice hockey dataset, the proposed method boosts matching performance from 57.1% to 81.4%, while also improving the meanAP of player+stick detections from 68.4% to 88.3%. On the COCO +Torso dataset, we see matching improving from 47.9% to 65.2%. The COCO +Torso dataset, code and pre-trained models will be released at https://github.com/foreverYoungGitHub/detect-and-match-related-objects.
Accurate and robust detection of multi-class objects in optical remote sensing images is essential to many real-world applications such as urban planning, traffic control, searching and rescuing, etc. However, state-of-the-art object detection techniques designed for images captured using ground-level sensors usually experience a sharp performance drop when directly applied to remote sensing images, largely due to the object appearance differences in remote sensing images in term of sparse texture, low contrast, arbitrary orientations, large scale variations, etc. This paper presents a novel object detection network (CAD-Net) that exploits attention-modulated features as well as global and local contexts to address the new challenges in detecting objects from remote sensing images. The proposed CAD-Net learns global and local contexts of objects by capturing their correlations with the global scene (at scene-level) and the local neighboring objects or features (at object-level), respectively. In addition, it designs a spatial-and-scale-aware attention module that guides the network to focus on more informative regions and features as well as more appropriate feature scales. Experiments over two publicly available object detection datasets for remote sensing images demonstrate that the proposed CAD-Net achieves superior detection performance. The implementation codes will be made publicly available for facilitating future researches.