This paper explores the potential of Large Language Models(LLMs) in zero-shot anomaly detection for safe visual navigation. With the assistance of the state-of-the-art real-time open-world object detection model Yolo-World and specialized prompts, the proposed framework can identify anomalies within camera-captured frames that include any possible obstacles, then generate concise, audio-delivered descriptions emphasizing abnormalities, assist in safe visual navigation in complex circumstances. Moreover, our proposed framework leverages the advantages of LLMs and the open-vocabulary object detection model to achieve the dynamic scenario switch, which allows users to transition smoothly from scene to scene, which addresses the limitation of traditional visual navigation. Furthermore, this paper explored the performance contribution of different prompt components, provided the vision for future improvement in visual accessibility, and paved the way for LLMs in video anomaly detection and vision-language understanding.
The rise of machine learning in recent years has brought benefits to various research fields such as wide fire detection. Nevertheless, small object detection and rare object detection remain a challenge. To address this problem, we present a dataset automata that can generate ground truth paired datasets using diffusion models. Specifically, we introduce a mask-guided diffusion framework that can fusion the wildfire into the existing images while the flame position and size can be precisely controlled. In advance, to fill the gap that the dataset of wildfire images in specific scenarios is missing, we vary the background of synthesized images by controlling both the text prompt and input image. Furthermore, to solve the color tint problem or the well-known domain shift issue, we apply the CLIP model to filter the generated massive dataset to preserve quality. Thus, our proposed framework can generate a massive dataset of that images are high-quality and ground truth-paired, which well addresses the needs of the annotated datasets in specific tasks.
Decreased myocardial capillary density has been reported as an important histopathological feature associated with various heart disorders. Quantitative assessment of cardiac capillarization typically involves double immunostaining of cardiomyocytes (CMs) and capillaries in myocardial slices. In contrast, single immunostaining of basement membrane components is a straightforward approach to simultaneously label CMs and capillaries, presenting fewer challenges in background staining. However, subsequent image analysis always requires manual work in identifying and segmenting CMs and capillaries. Here, we developed an image analysis tool, AutoQC, to automatically identify and segment CMs and capillaries in immunofluorescence images of collagen type IV, a predominant basement membrane protein within the myocardium. In addition, commonly used capillarization-related measurements can be derived from segmentation masks. AutoQC features a weakly supervised instance segmentation algorithm by leveraging the power of a pre-trained segmentation model via prompt engineering. AutoQC outperformed YOLOv8-Seg, a state-of-the-art instance segmentation model, in both instance segmentation and capillarization assessment. Furthermore, the training of AutoQC required only a small dataset with bounding box annotations instead of pixel-wise annotations, leading to a reduced workload during network training. AutoQC provides an automated solution for quantifying cardiac capillarization in basement-membrane-immunostained myocardial slices, eliminating the need for manual image analysis once it is trained.
Medical time series data are indispensable in healthcare, providing critical insights for disease diagnosis, treatment planning, and patient management. The exponential growth in data complexity, driven by advanced sensor technologies, has presented challenges related to data labeling. Self-supervised learning (SSL) has emerged as a transformative approach to address these challenges, eliminating the need for extensive human annotation. In this study, we introduce a novel framework for Medical Time Series Representation Learning, known as MTS-LOF. MTS-LOF leverages the strengths of contrastive learning and Masked Autoencoder (MAE) methods, offering a unique approach to representation learning for medical time series data. By combining these techniques, MTS-LOF enhances the potential of healthcare applications by providing more sophisticated, context-rich representations. Additionally, MTS-LOF employs a multi-masking strategy to facilitate occlusion-invariant feature learning. This approach allows the model to create multiple views of the data by masking portions of it. By minimizing the discrepancy between the representations of these masked patches and the fully visible patches, MTS-LOF learns to capture rich contextual information within medical time series datasets. The results of experiments conducted on diverse medical time series datasets demonstrate the superiority of MTS-LOF over other methods. These findings hold promise for significantly enhancing healthcare applications by improving representation learning. Furthermore, our work delves into the integration of joint-embedding SSL and MAE techniques, shedding light on the intricate interplay between temporal and structural dependencies in healthcare data. This understanding is crucial, as it allows us to grasp the complexities of healthcare data analysis.
With more than 32% of the global energy used by commercial and residential buildings, there is an urgent need to revisit traditional approaches to Building Energy Management (BEM). With HVAC systems accounting for about 40% of the total energy cost in the commercial sector, we propose a low-complexity DRL-based model with multi-input multi-output architecture for the HVAC energy optimization of open-plan offices, which uses only a handful of controllable and accessible factors. The efficacy of our solution is evaluated through extensive analysis of the overall energy consumption and thermal comfort levels compared to a baseline system based on the existing HVAC schedule in a real building. This comparison shows that our method achieves 37% savings in energy consumption with minimum violation (<1%) of the desired temperature range during work hours. It takes only a total of 40 minutes for 5 epochs (about 7.75 minutes per epoch) to train a network with superior performance and covering diverse conditions for its low-complexity architecture; therefore, it easily adapts to changes in the building setups, weather conditions, occupancy rate, etc. Moreover, by enforcing smoothness on the control strategy, we suppress the frequent and unpleasant on/off transitions on HVAC units to avoid occupant discomfort and potential damage to the system. The generalizability of our model is verified by applying it to different building models and under various weather conditions.
This paper proposes a distributed version of Determinant Point Processing (DPP) inference to enhance multi-source data diversification under limited communication bandwidth. DPP is a popular probabilistic approach that improves data diversity by enforcing the repulsion of elements in the selected subsets. The well-studied Maximum A Posteriori (MAP) inference in DPP aims to identify the subset with the highest diversity quantified by DPP. However, this approach is limited by the presumption that all data samples are available at one point, which hinders its applicability to real-world applications such as traffic datasets where data samples are distributed across sources and communication between them is band-limited. Inspired by the techniques used in Multiple-Input Multiple-Output (MIMO) communication systems, we propose a strategy for performing MAP inference among distributed sources. Specifically, we show that a lower bound of the diversity-maximized distributed sample selection problem can be treated as a power allocation problem in MIMO systems. A determinant-preserved sparse representation of selected samples is used to perform sample precoding in local sources to be processed by DPP. Our method does not require raw data exchange among sources, but rather a band-limited feedback channel to send lightweight diversity measures, analogous to the CSI message in MIMO systems, from the center to data sources. The experiments show that our scalable approach can outperform baseline methods, including random selection, uninformed individual DPP with no feedback, and DPP with SVD-based feedback, in both i.i.d and non-i.i.d setups. Specifically, it achieves 1 to 6 log-difference diversity gain in the latent representation of CIFAR-10, CIFAR-100, StanfordCars, and GTSRB datasets.
Knowledge distillation is a powerful technique to compress large neural networks into smaller, more efficient networks. Softmax regression representation learning is a popular approach that uses a pre-trained teacher network to guide the learning of a smaller student network. While several studies explored the effectiveness of softmax regression representation learning, the underlying mechanism that provides knowledge transfer is not well understood. This paper presents Ideal Joint Classifier Knowledge Distillation (IJCKD), a unified framework that provides a clear and comprehensive understanding of the existing knowledge distillation methods and a theoretical foundation for future research. Using mathematical techniques derived from a theory of domain adaptation, we provide a detailed analysis of the student network's error bound as a function of the teacher. Our framework enables efficient knowledge transfer between teacher and student networks and can be applied to various applications.
In some practical learning tasks, such as traffic video analysis, the number of available training samples is restricted by different factors, such as limited communication bandwidth and computation power; therefore, it is imperative to select diverse data samples that contribute the most to the quality of the learning system. One popular approach to selecting diverse samples is Determinantal Point Process (DPP). However, it suffers from a few known drawbacks, such as restriction of the number of samples to the rank of the similarity matrix, and not being customizable for specific learning tasks (e.g., multi-level classification tasks). In this paper, we propose a new way of measuring task-oriented diversity based on the Rate-Distortion (RD) theory, appropriate for multi-level classification. To this end, we establish a fundamental relationship between DPP and RD theory, which led to designing RD-DPP, an RD-based value function to evaluate the diversity gain of data samples. We also observe that the upper bound of the diversity of data selected by DPP has a universal trend of phase transition that quickly approaches its maximum point, then slowly converges to its final limits, meaning that DPP is beneficial only at the beginning of sample accumulation. We use this fact to design a bi-modal approach for sequential data selection.
This paper offers a new authentication algorithm based on image matching of nano-resolution visual identifiers with tree-shaped patterns. The algorithm includes image-to-tree conversion by greedy extraction of the fractal pattern skeleton along with a custom-built graph matching algorithm that is robust against imaging artifacts such as scaling, rotation, scratch, and illumination change. The proposed algorithm is applicable to a variety of tree-structured image matching, but our focus is on dendrites, recently-developed visual identifiers. Dendrites are entropy rich and unclonable with existing 2D and 3D printers due to their natural randomness, nano-resolution granularity, and 3D facets, making them an appropriate choice for security applications such as supply chain trace and tracking. The proposed algorithm improves upon graph matching with standard image descriptors. For instance, image inconsistency due to the camera sensor noise may cause unexpected feature extraction leading to inaccurate tree conversion and authentication failure. Also, previous tree extraction algorithms are prohibitively slow hindering their scalability to large systems. In this paper, we fix the current issues of [1] and accelerate the key points extraction up to 10-times faster by implementing a new skeleton extraction method, a new key points searching algorithm, as well as an optimized key point matching algorithm. Using minimum enclosing circle and center points, make the algorithm robust to the choice of pattern shape. In contrast to [1] our algorithm handles general graphs with loop connections, therefore is applicable to a wider range of applications such as transportation map analysis, fingerprints, and retina vessel imaging.