To alleviate the heavy annotation burden for training a reliable crowd counting model and thus make the model more practicable and accurate by being able to benefit from more data, this paper presents a new semi-supervised method based on the mean teacher framework. When there is a scarcity of labeled data available, the model is prone to overfit local patches. Within such contexts, the conventional approach of solely improving the accuracy of local patch predictions through unlabeled data proves inadequate. Consequently, we propose a more nuanced approach: fostering the model's intrinsic 'subitizing' capability. This ability allows the model to accurately estimate the count in regions by leveraging its understanding of the crowd scenes, mirroring the human cognitive process. To achieve this goal, we apply masking on unlabeled data, guiding the model to make predictions for these masked patches based on the holistic cues. Furthermore, to help with feature learning, herein we incorporate a fine-grained density classification task. Our method is general and applicable to most existing crowd counting methods as it doesn't have strict structural or loss constraints. In addition, we observe that the model trained with our framework exhibits a 'subitizing'-like behavior. It accurately predicts low-density regions with only a 'glance', while incorporating local details to predict high-density regions. Our method achieves the state-of-the-art performance, surpassing previous approaches by a large margin on challenging benchmarks such as ShanghaiTech A and UCF-QNRF. The code is available at: https://github.com/cha15yq/MRC-Crowd.
Meta AI recently released the Segment Anything model (SAM), which has garnered attention due to its impressive performance in class-agnostic segmenting. In this study, we explore the use of SAM for the challenging task of few-shot object counting, which involves counting objects of an unseen category by providing a few bounding boxes of examples. We compare SAM's performance with other few-shot counting methods and find that it is currently unsatisfactory without further fine-tuning, particularly for small and crowded objects. Code can be found at \url{https://github.com/Vision-Intelligence-and-Robots-Group/count-anything}.
Task allocation plays a vital role in multi-robot autonomous cleaning systems, where multiple robots work together to clean a large area. However, most current studies mainly focus on deterministic, single-task allocation for cleaning robots, without considering hybrid tasks in uncertain working environments. Moreover, there is a lack of datasets and benchmarks for relevant research. In this paper, to address these problems, we formulate multi-robot hybrid-task allocation under the uncertain cleaning environment as a robust optimization problem. Firstly, we propose a novel robust mixed-integer linear programming model with practical constraints including the task order constraint for different tasks and the ability constraints of hybrid robots. Secondly, we establish a dataset of \emph{100} instances made from floor plans, each of which has 2D manually-labeled images and a 3D model. Thirdly, we provide comprehensive results on the collected dataset using three traditional optimization approaches and a deep reinforcement learning-based solver. The evaluation results show that our solution meets the needs of multi-robot cleaning task allocation and the robust solver can protect the system from worst-case scenarios with little additional cost. The benchmark will be available at {https://github.com/iamwangyabin/Multi-robot-Cleaning-Task-Allocation}.
Although data-free incremental learning methods are memory-friendly, accurately estimating and counteracting representation shifts is challenging in the absence of historical data. This paper addresses this thorny problem by proposing a novel incremental learning method inspired by human analogy capabilities. Specifically, we design an analogy-making mechanism to remap the new data into the old class by prompt tuning. It mimics the feature distribution of the target old class on the old model using only samples of new classes. The learnt prompts are further used to estimate and counteract the representation shift caused by fine-tuning for the historical prototypes. The proposed method sets up new state-of-the-art performance on four incremental learning benchmarks under both the class and domain incremental learning settings. It consistently outperforms data-replay methods by only saving feature prototypes for each class. It has almost hit the empirical upper bound by joint training on the Core50 benchmark. The code will be released at \url{https://github.com/ZhihengCV/A-Prompts}.
Deepfake technologies have been blurring the boundaries between the real and unreal, likely resulting in malicious events. By leveraging newly emerged deepfake technologies, deepfake researchers have been making a great upending to create deepfake artworks (deeparts), which are further closing the gap between reality and fantasy. To address potentially appeared ethics questions, this paper establishes a deepart detection database (DDDB) that consists of a set of high-quality conventional art images (conarts) and five sets of deepart images generated by five state-of-the-art deepfake models. This database enables us to explore once-for-all deepart detection and continual deepart detection. For the two new problems, we suggest four benchmark evaluations and four families of solutions on the constructed DDDB. The comprehensive study demonstrates the effectiveness of the proposed solutions on the established benchmark dataset, which is capable of paving a way to more interesting directions of deepart detection. The constructed benchmark dataset and the source code will be made publicly available.
This paper focuses on the prevalent performance imbalance in the stages of incremental learning. To avoid obvious stage learning bottlenecks, we propose a brand-new stage-isolation based incremental learning framework, which leverages a series of stage-isolated classifiers to perform the learning task of each stage without the interference of others. To be concrete, to aggregate multiple stage classifiers as a uniform one impartially, we first introduce a temperature-controlled energy metric for indicating the confidence score levels of the stage classifiers. We then propose an anchor-based energy self-normalization strategy to ensure the stage classifiers work at the same energy level. Finally, we design a voting-based inference augmentation strategy for robust inference. The proposed method is rehearsal free and can work for almost all continual learning scenarios. We evaluate the proposed method on four large benchmarks. Extensive results demonstrate the superiority of the proposed method in setting up new state-of-the-art overall performance. \emph{Code is available at} \url{https://github.com/iamwangyabin/ESN}.
In this paper, we propose a new agency-guided semi-supervised counting approach. First, we build a learnable auxiliary structure, namely the density agency to bring the recognized foreground regional features close to corresponding density sub-classes (agents) and push away background ones. Second, we propose a density-guided contrastive learning loss to consolidate the backbone feature extractor. Third, we build a regression head by using a transformer structure to refine the foreground features further. Finally, an efficient noise depression loss is provided to minimize the negative influence of annotation noises. Extensive experiments on four challenging crowd counting datasets demonstrate that our method achieves superior performance to the state-of-the-art semi-supervised counting methods by a large margin. Code is available.
State-of-the-art deep neural networks are still struggling to address the catastrophic forgetting problem in continual learning. In this paper, we propose one simple paradigm (named as S-Prompting) and two concrete approaches to highly reduce the forgetting degree in one of the most typical continual learning scenarios, i.e., domain increment learning (DIL). The key idea of the paradigm is to learn prompts independently across domains with pre-trained transformers, avoiding the use of exemplars that commonly appear in conventional methods. This results in a win-win game where the prompting can achieve the best for each domain. The independent prompting across domains only requests one single cross-entropy loss for training and one simple K-NN operation as a domain identifier for inference. The learning paradigm derives an image prompt learning approach and a brand-new language-image prompt learning approach. Owning an excellent scalability (0.03% parameter increase per domain), the best of our approaches achieves a remarkable relative improvement (an average of about 30%) over the best of the state-of-the-art exemplar-free methods for three standard DIL tasks, and even surpasses the best of them relatively by about 6% in average when they use exemplars.