Active learning is a commonly used approach that reduces the labeling effort required to train deep neural networks. However, the effectiveness of current active learning methods is limited by their closed-world assumptions, which assume that all data in the unlabeled pool comes from a set of predefined known classes. This assumption is often not valid in practical situations, as there may be unknown classes in the unlabeled data, leading to the active open-set annotation problem. The presence of unknown classes in the data can significantly impact the performance of existing active learning methods due to the uncertainty they introduce. To address this issue, we propose a novel data-centric active learning method called NEAT that actively annotates open-set data. NEAT is designed to label known classes data from a pool of both known and unknown classes unlabeled data. It utilizes the clusterability of labels to identify the known classes from the unlabeled pool and selects informative samples from those classes based on a consistency criterion that measures inconsistencies between model predictions and local feature distribution. Unlike the recently proposed learning-centric method for the same problem, NEAT is much more computationally efficient and is a data-centric active open-set annotation method. Our experiments demonstrate that NEAT achieves significantly better performance than state-of-the-art active learning methods for active open-set annotation.
Semantic segmentation models, while effective for in-distribution categories, face challenges in real-world deployment due to encountering out-of-distribution (OoD) objects. Detecting these OoD objects is crucial for safety-critical applications. Existing methods rely on anomaly scores, but choosing a suitable threshold for generating masks presents difficulties and can lead to fragmentation and inaccuracy. This paper introduces a method to convert anomaly Score To segmentation Mask, called S2M, a simple and effective framework for OoD detection in semantic segmentation. Unlike assigning anomaly scores to pixels, S2M directly segments the entire OoD object. By transforming anomaly scores into prompts for a promptable segmentation model, S2M eliminates the need for threshold selection. Extensive experiments demonstrate that S2M outperforms the state-of-the-art by approximately 10\% in IoU and 30\% in mean F1 score, on average, across various benchmarks including Fishyscapes, Segment-Me-If-You-Can, and RoadAnomaly datasets.
Two prevalent types of distributional shifts in machine learning are the covariate shift (as observed across different domains) and the semantic shift (as seen across different classes). Traditional OOD detection techniques typically address only one of these shifts. However, real-world testing environments often present a combination of both covariate and semantic shifts. In this study, we introduce a novel problem, semantic OOD detection across domains, which simultaneously addresses both distributional shifts. To this end, we introduce two regularization strategies: domain generalization regularization, which ensures semantic invariance across domains to counteract the covariate shift, and OOD detection regularization, designed to enhance OOD detection capabilities against the semantic shift through energy bounding. Through rigorous testing on three standard domain generalization benchmarks, our proposed framework showcases its superiority over conventional domain generalization approaches in terms of OOD detection performance. Moreover, it holds its ground by maintaining comparable InD classification accuracy.
Human affect recognition has been a significant topic in psychophysics and computer vision. However, the currently published datasets have many limitations. For example, most datasets contain frames that contain only information about facial expressions. Due to the limitations of previous datasets, it is very hard to either understand the mechanisms for affect recognition of humans or generalize well on common cases for computer vision models trained on those datasets. In this work, we introduce a brand new large dataset, the Video-based Emotion and Affect Tracking in Context Dataset (VEATIC), that can conquer the limitations of the previous datasets. VEATIC has 124 video clips from Hollywood movies, documentaries, and home videos with continuous valence and arousal ratings of each frame via real-time annotation. Along with the dataset, we propose a new computer vision task to infer the affect of the selected character via both context and character information in each video frame. Additionally, we propose a simple model to benchmark this new computer vision task. We also compare the performance of the pretrained model using our dataset with other similar datasets. Experiments show the competing results of our pretrained model via VEATIC, indicating the generalizability of VEATIC. Our dataset is available at https://veatic.github.io.
In this paper, we introduce audio-visual class-incremental learning, a class-incremental learning scenario for audio-visual video recognition. We demonstrate that joint audio-visual modeling can improve class-incremental learning, but current methods fail to preserve semantic similarity between audio and visual features as incremental step grows. Furthermore, we observe that audio-visual correlations learned in previous tasks can be forgotten as incremental steps progress, leading to poor performance. To overcome these challenges, we propose AV-CIL, which incorporates Dual-Audio-Visual Similarity Constraint (D-AVSC) to maintain both instance-aware and class-aware semantic similarity between audio-visual modalities and Visual Attention Distillation (VAD) to retain previously learned audio-guided visual attentive ability. We create three audio-visual class-incremental datasets, AVE-Class-Incremental (AVE-CI), Kinetics-Sounds-Class-Incremental (K-S-CI), and VGGSound100-Class-Incremental (VS100-CI) based on the AVE, Kinetics-Sounds, and VGGSound datasets, respectively. Our experiments on AVE-CI, K-S-CI, and VS100-CI demonstrate that AV-CIL significantly outperforms existing class-incremental learning methods in audio-visual class-incremental learning. Code and data are available at: https://github.com/weiguoPian/AV-CIL_ICCV2023.
Given an image set without any labels, our goal is to train a model that maps each image to a point in a feature space such that, not only proximity indicates visual similarity, but where it is located directly encodes how prototypical the image is according to the dataset. Our key insight is to perform unsupervised feature learning in hyperbolic instead of Euclidean space, where the distance between points still reflect image similarity, and yet we gain additional capacity for representing prototypicality with the location of the point: The closer it is to the origin, the more prototypical it is. The latter property is simply emergent from optimizing the usual metric learning objective: The image similar to many training instances is best placed at the center of corresponding points in Euclidean space, but closer to the origin in hyperbolic space. We propose an unsupervised feature learning algorithm in Hyperbolic space with sphere pACKing. HACK first generates uniformly packed particles in the Poincar\'e ball of hyperbolic space and then assigns each image uniquely to each particle. Images after congealing are regarded more typical of the dataset it belongs to. With our feature mapper simply trained to spread out training instances in hyperbolic space, we observe that images move closer to the origin with congealing, validating our idea of unsupervised prototypicality discovery. We demonstrate that our data-driven prototypicality provides an easy and superior unsupervised instance selection to reduce sample complexity, increase model generalization with atypical instances and robustness with typical ones.
We introduce a novel robotic system for improving unseen object instance segmentation in the real world by leveraging long-term robot interaction with objects. Previous approaches either grasp or push an object and then obtain the segmentation mask of the grasped or pushed object after one action. Instead, our system defers the decision on segmenting objects after a sequence of robot pushing actions. By applying multi-object tracking and video object segmentation on the images collected via robot pushing, our system can generate segmentation masks of all the objects in these images in a self-supervised way. These include images where objects are very close to each other, and segmentation errors usually occur on these images for existing object segmentation networks. We demonstrate the usefulness of our system by fine-tuning segmentation networks trained on synthetic data with real-world data collected by our system. We show that, after fine-tuning, the segmentation accuracy of the networks is significantly improved both in the same domain and across different domains. In addition, we verify that the fine-tuned networks improve top-down robotic grasping of unseen objects in the real world.
Unsupervised lifelong learning refers to the ability to learn over time while memorizing previous patterns without supervision. Previous works assumed strong prior knowledge about the incoming data (e.g., knowing the class boundaries) which can be impossible to obtain in complex and unpredictable environments. In this paper, motivated by real-world scenarios, we formally define the online unsupervised lifelong learning problem with class-incremental streaming data, which is non-iid and single-pass. The problem is more challenging than existing lifelong learning problems due to the absence of labels and prior knowledge. To address the issue, we propose Self-Supervised ContrAstive Lifelong LEarning (SCALE) which extracts and memorizes knowledge on-the-fly. SCALE is designed around three major components: a pseudo-supervised contrastive loss, a self-supervised forgetting loss, and an online memory update for uniform subset selection. All three components are designed to work collaboratively to maximize learning performance. Our loss functions leverage pairwise similarity thus remove the dependency on supervision or prior knowledge. We perform comprehensive experiments of SCALE under iid and four non-iid data streams. SCALE outperforms the best state-of-the-art algorithm on all settings with improvements of up to 6.43%, 5.23% and 5.86% kNN accuracy on CIFAR-10, CIFAR-100 and SubImageNet datasets.