Text-to-image person re-identification (TIReID) aims to retrieve the target person from an image gallery via a textual description query. Recently, pre-trained vision-language models like CLIP have attracted significant attention and have been widely utilized for this task due to their robust capacity for semantic concept learning and rich multi-modal knowledge. However, recent CLIP-based TIReID methods commonly rely on direct fine-tuning of the entire network to adapt the CLIP model for the TIReID task. Although these methods show competitive performance on this topic, they are suboptimal as they necessitate simultaneous domain adaptation and task adaptation. To address this issue, we attempt to decouple these two processes during the training stage. Specifically, we introduce the prompt tuning strategy to enable domain adaptation and propose a two-stage training approach to disentangle domain adaptation from task adaptation. In the first stage, we freeze the two encoders from CLIP and solely focus on optimizing the prompts to alleviate domain gap between the original training data of CLIP and downstream tasks. In the second stage, we maintain the fixed prompts and fine-tune the CLIP model to prioritize capturing fine-grained information, which is more suitable for TIReID task. Finally, we evaluate the effectiveness of our method on three widely used datasets. Compared to the directly fine-tuned approach, our method achieves significant improvements.
This paper proposes a spatiotemporal clustering algorithm and its implementation in the R package spotoroo. This work is motivated by the catastrophic bushfires in Australia throughout the summer of 2019-2020 and made possible by the availability of satellite hotspot data. The algorithm is inspired by two existing spatiotemporal clustering algorithms but makes enhancements to cluster points spatially in conjunction with their movement across consecutive time periods. It also allows for the adjustment of key parameters, if required, for different locations and satellite data sources. Bushfire data from Victoria, Australia, is used to illustrate the algorithm and its use within the package.
Hyperspectral Imaging (HSI) provides detailed spectral information and has been utilised in many real-world applications. This work introduces an HSI dataset of building facades in a light industry environment with the aim of classifying different building materials in a scene. The dataset is called the Light Industrial Building HSI (LIB-HSI) dataset. This dataset consists of nine categories and 44 classes. In this study, we investigated deep learning based semantic segmentation algorithms on RGB and hyperspectral images to classify various building materials, such as timber, brick and concrete.
Point cloud analysis is receiving increasing attention, however, most existing point cloud models lack the practical ability to deal with the unavoidable presence of unknown objects. This paper mainly discusses point cloud analysis under open-set settings, where we train the model without data from unknown classes and identify them in the inference stage. Basically, we propose to solve open-set point cloud analysis using a novel Point Cut-and-Mix mechanism consisting of Unknown-Point Simulator and Unknown-Point Estimator modules. Specifically, we use the Unknown-Point Simulator to simulate unknown data in the training stage by manipulating the geometric context of partial known data. Based on this, the Unknown-Point Estimator module learns to exploit the point cloud's feature context for discriminating the known and unknown data. Extensive experiments show the plausibility of open-set point cloud analysis and the effectiveness of our proposed solutions. Our code is available at \url{https://github.com/ShiQiu0419/pointcam}.
Learning a latent embedding to understand the underlying nature of data distribution is often formulated in Euclidean spaces with zero curvature. However, the success of the geometry constraints, posed in the embedding space, indicates that curved spaces might encode more structural information, leading to better discriminative power and hence richer representations. In this work, we investigate benefits of the curved space for analyzing anomalies or out-of-distribution objects in data. This is achieved by considering embeddings via three geometry constraints, namely, spherical geometry (with positive curvature), hyperbolic geometry (with negative curvature) or mixed geometry (with both positive and negative curvatures). Three geometric constraints can be chosen interchangeably in a unified design given the task at hand. Tailored for the embeddings in the curved space, we also formulate functions to compute the anomaly score. Two types of geometric modules (i.e., Geometric-in-One and Geometric-in-Two models) are proposed to plug in the original Euclidean classifier, and anomaly scores are computed from the curved embeddings. We evaluate the resulting designs under a diverse set of visual recognition scenarios, including image detection (multi-class OOD detection and one-class anomaly detection) and segmentation (multi-class anomaly segmentation and one-class anomaly segmentation). The empirical results show the effectiveness of our proposal through the consistent improvement over various scenarios.
With the human pursuit of knowledge, open-set object detection (OSOD) has been designed to identify unknown objects in a dynamic world. However, an issue with the current setting is that all the predicted unknown objects share the same category as "unknown", which require incremental learning via a human-in-the-loop approach to label novel classes. In order to address this problem, we present a new task, namely Open-Set Object Detection and Discovery (OSODD). This new task aims to extend the ability of open-set object detectors to further discover the categories of unknown objects based on their visual appearance without human effort. We propose a two-stage method that first uses an open-set object detector to predict both known and unknown objects. Then, we study the representation of predicted objects in an unsupervised manner and discover new categories from the set of unknown objects. With this method, a detector is able to detect objects belonging to known classes and define novel categories for objects of unknown classes with minimal supervision. We show the performance of our model on the MS-COCO dataset under a thorough evaluation protocol. We hope that our work will promote further research towards a more robust real-world detection system.
In this paper, we present and study a new image segmentation task, called Generalized Open-set Semantic Segmentation (GOSS). Previously, with the well-known open-set semantic segmentation (OSS), the intelligent agent only detects the unknown regions without further processing, limiting their perception of the environment. It stands to reason that a further analysis of the detected unknown pixels would be beneficial. Therefore, we propose GOSS, which unifies the abilities of two well-defined segmentation tasks, OSS and generic segmentation (GS), in a holistic way. Specifically, GOSS classifies pixels as belonging to known classes, and clusters (or groups) of pixels of unknown class are labelled as such. To evaluate this new expanded task, we further propose a metric which balances the pixel classification and clustering aspects. Moreover, we build benchmark tests on top of existing datasets and propose a simple neural architecture as a baseline, which jointly predicts pixel classification and clustering under open-set settings. Our experiments on multiple benchmarks demonstrate the effectiveness of our baseline. We believe our new GOSS task can produce an expressive image understanding for future research. Code will be made available.
We present You Only Cut Once (YOCO) for performing data augmentations. YOCO cuts one image into two pieces and performs data augmentations individually within each piece. Applying YOCO improves the diversity of the augmentation per sample and encourages neural networks to recognize objects from partial information. YOCO enjoys the properties of parameter-free, easy usage, and boosting almost all augmentations for free. Thorough experiments are conducted to evaluate its effectiveness. We first demonstrate that YOCO can be seamlessly applied to varying data augmentations, neural network architectures, and brings performance gains on CIFAR and ImageNet classification tasks, sometimes surpassing conventional image-level augmentation by large margins. Moreover, we show YOCO benefits contrastive pre-training toward a more powerful representation that can be better transferred to multiple downstream tasks. Finally, we study a number of variants of YOCO and empirically analyze the performance for respective settings. Code is available at GitHub.
We present and study a novel task named Blind Image Decomposition (BID), which requires separating a superimposed image into constituent underlying images in a blind setting, that is, both the source components involved in mixing as well as the mixing mechanism are unknown. For example, rain may consist of multiple components, such as rain streaks, raindrops, snow, and haze. Rainy images can be treated as an arbitrary combination of these components, some of them or all of them. How to decompose superimposed images, like rainy images, into distinct source components is a crucial step towards real-world vision systems. To facilitate research on this new task, we construct three benchmark datasets, including mixed image decomposition across multiple domains, real-scenario deraining, and joint shadow/reflection/watermark removal. Moreover, we propose a simple yet general Blind Image Decomposition Network (BIDeN) to serve as a strong baseline for future work. Experimental results demonstrate the tenability of our benchmarks and the effectiveness of BIDeN. Code and project page are available.