Abstract:Zero-shot 3D anomaly detection aims to identify anomalies without access to training data from target categories. However, existing methods mainly rely on projecting 3D observations into multi-view representations that primarily capture geometric cues rather than realistic visual semantics and process them with vision encoders pretrained on RGB data, leading to a significant domain gap between the encoder and the projected representations. To address this issue, we propose Align3D-AD, a unified two-stage framework that leverages the RGB modality from auxiliary categories as cross-modal guidance for zero-shot 3D anomaly detection. First, we introduce a cross-modal feature alignment paradigm that maps rendering features into the RGB semantic space. Unlike prior works that implicitly rely on pretrained encoders, our method enables direct semantic transfer from RGB observations. A semantic consistency reweighting strategy is further introduced to refine feature alignment by reweighting local regions according to holistic semantic consistency. Second, we propose a modality-aware prompt learning framework with dual-prompt contrastive alignment. By assigning independent prompts to RGB-aligned and rendering features, our method captures complementary semantics across modalities, while the contrastive alignment further enhances prompt representations to improve discriminability. Extensive experiments on MVTec3D-AD, Eyecandies, and Real3D-AD demonstrate that Align3D-AD consistently outperforms existing zero-shot methods under both one-vs-rest and cross-dataset settings, highlighting its generalization capability and robustness. Code and the dataset will be made available once our paper is accepted.
Abstract:Recent advances in multi-modal detection have significantly improved detection accuracy in challenging environments (e.g., low light, overexposure). By integrating RGB with modalities such as thermal and depth, multi-modal fusion increases data redundancy and system robustness. However, significant challenges remain in effectively extracting task-relevant information both within and across modalities, as well as in achieving precise cross-modal alignment. While CNNs excel at feature extraction, they are limited by constrained receptive fields, strong inductive biases, and difficulty in capturing long-range dependencies. Transformer-based models offer global context but suffer from quadratic computational complexity and are confined to pairwise correlation modeling. Mamba and other State Space Models (SSMs), on the other hand, are hindered by their sequential scanning mechanism, which flattens 2D spatial structures into 1D sequences, disrupting topological relationships and limiting the modeling of complex higher-order dependencies. To address these issues, we propose a multi-modal perception network based on hypergraph theory called M2I2HA. Our architecture includes an Intra-Hypergraph Enhancement module to capture global many-to-many high-order relationships within each modality, and an Inter-Hypergraph Fusion module to align, enhance, and fuse cross-modal features by bridging configuration and spatial gaps between data sources. We further introduce a M2-FullPAD module to enable adaptive multi-level fusion of multi-modal enhanced features within the network, meanwhile enhancing data distribution and flow across the architecture. Extensive object detection experiments on multiple public datasets against baselines demonstrate that M2I2HA achieves state-of-the-art performance in multi-modal object detection tasks.
Abstract:Recent advances in multi-modal detection have significantly improved detection accuracy in challenging environments (e.g., low light, overexposure). By integrating RGB with modalities such as thermal and depth, multi-modal fusion increases data redundancy and system robustness. However, significant challenges remain in effectively extracting task-relevant information both within and across modalities, as well as in achieving precise cross-modal alignment. While CNNs excel at feature extraction, they are limited by constrained receptive fields, strong inductive biases, and difficulty in capturing long-range dependencies. Transformer-based models offer global context but suffer from quadratic computational complexity and are confined to pairwise correlation modeling. Mamba and other State Space Models (SSMs), on the other hand, are hindered by their sequential scanning mechanism, which flattens 2D spatial structures into 1D sequences, disrupting topological relationships and limiting the modeling of complex higher-order dependencies. To address these issues, we propose a multi-modal perception network based on hypergraph theory called M2I2HA. Our architecture includes an Intra-Hypergraph Enhancement module to capture global many-to-many high-order relationships within each modality, and an Inter-Hypergraph Fusion module to align, enhance, and fuse cross-modal features by bridging configuration and spatial gaps between data sources. We further introduce a M2-FullPAD module to enable adaptive multi-level fusion of multi-modal enhanced features within the network, meanwhile enhancing data distribution and flow across the architecture. Extensive object detection experiments on multiple public datasets against baselines demonstrate that M2I2HA achieves state-of-the-art performance in multi-modal object detection tasks.




Abstract:Industrial quality inspection plays a critical role in modern manufacturing by identifying defective products during production. While single-modality approaches using either 3D point clouds or 2D RGB images suffer from information incompleteness, multimodal anomaly detection offers promise through the complementary fusion of crossmodal data. However, existing methods face challenges in effectively integrating unimodal results and improving discriminative power. To address these limitations, we first reinterpret memory bank-based anomaly scores in single modalities as isotropic Euclidean distances in local feature spaces. Dynamically evolving from Eulidean metrics, we propose a novel \underline{G}eometry-\underline{G}uided \underline{S}core \underline{F}usion (G$^{2}$SF) framework that progressively learns an anisotropic local distance metric as a unified score for the fusion task. Through a geometric encoding operator, a novel Local Scale Prediction Network (LSPN) is proposed to predict direction-aware scaling factors that characterize first-order local feature distributions, thereby enhancing discrimination between normal and anomalous patterns. Additionally, we develop specialized loss functions and score aggregation strategy from geometric priors to ensure both metric generalization and efficacy. Comprehensive evaluations on the MVTec-3D AD dataset demonstrate the state-of-the-art detection performance of our method with low positive rate and better recall, which is essential in industrial application, and detailed ablation analysis validates each component's contribution.




Abstract:Surface anomaly classification is critical for manufacturing system fault diagnosis and quality control. However, the following challenges always hinder accurate anomaly classification in practice: (i) Anomaly patterns exhibit intra-class variation and inter-class similarity, presenting challenges in the accurate classification of each sample. (ii) Despite the predefined classes, new types of anomalies can occur during production that require to be detected accurately. (iii) Anomalous data is rare in manufacturing processes, leading to limited data for model learning. To tackle the above challenges simultaneously, this paper proposes a novel deep subspace learning-based 3D anomaly classification model. Specifically, starting from a lightweight encoder to extract the latent representations, we model each class as a subspace to account for the intra-class variation, while promoting distinct subspaces of different classes to tackle the inter-class similarity. Moreover, the explicit modeling of subspaces offers the capability to detect out-of-distribution samples, i.e., new types of anomalies, and the regularization effect with much fewer learnable parameters of our proposed subspace classifier, compared to the popular Multi-Layer Perceptions (MLPs). Extensive numerical experiments demonstrate our method achieves better anomaly classification results than benchmark methods, and can effectively identify the new types of anomalies.




Abstract:The surface quality inspection of manufacturing parts based on 3D point cloud data has attracted increasing attention in recent years. The reason is that the 3D point cloud can capture the entire surface of manufacturing parts, unlike the previous practices that focus on some key product characteristics. However, achieving accurate 3D anomaly detection is challenging, due to the complex surfaces of manufacturing parts and the difficulty of collecting sufficient anomaly samples. To address these challenges, we propose a novel untrained anomaly detection method based on 3D point cloud data for complex manufacturing parts, which can achieve accurate anomaly detection in a single sample without training data. In the proposed framework, we transform an input sample into two sets of profiles along different directions. Based on one set of the profiles, a novel segmentation module is devised to segment the complex surface into multiple basic and simple components. In each component, another set of profiles, which have the nature of similar shapes, can be modeled as a low-rank matrix. Thus, accurate 3D anomaly detection can be achieved by using Robust Principal Component Analysis (RPCA) on these low-rank matrices. Extensive numerical experiments on different types of parts show that our method achieves promising results compared with the benchmark methods.