Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qiuhui Chen

ZSG-IAD: A Multimodal Framework for Zero-Shot Grounded Industrial Anomaly Detection

Apr 20, 2026

Qiuhui Chen, Jiaxiang Song, Shuai Tan, Weimin Zhong

Abstract:Deep learning-based industrial anomaly detectors often behave as black boxes, making it hard to justify decisions with physically meaningful defect evidence. We propose ZSG-IAD, a multimodal vision-language framework for zero-shot grounded industrial anomaly detection. Given RGB images, sensor images, and 3D point clouds, ZSG-IAD generates structured anomaly reports and pixel-level anomaly masks. ZSG-IAD introduces a language-guided two-hop grounding module: (1) anomaly-related sentences select evidence-like latent slots distilled from multimodal features, yielding coarse spatial support; (2) selected slots modulate feature maps via channel-spatial gating and a lightweight decoder to produce fine-grained masks. To improve reliability, we further apply Executable-Rule GRPO with verifiable rewards to promote structured outputs, anomaly-region consistency, and reasoning-conclusion coherence. Experiments across multiple industrial anomaly benchmarks show strong zero-shot performance and more transparent, physically grounded explanations than prior methods. We will release code and annotations to support future research on trustworthy industrial anomaly detection systems.

Via

Access Paper or Ask Questions

AD-Reasoning: Multimodal Guideline-Guided Reasoning for Alzheimer's Disease Diagnosis

Mar 25, 2026

Qiuhui Chen, Yushan Deng, Xuancheng Yao, Yi Hong

Abstract:Alzheimer's disease (AD) diagnosis requires integrating neuroimaging with heterogeneous clinical evidence and reasoning under established criteria, yet most multimodal models remain opaque and weakly guideline-aligned. We present AD-Reasoning, a multimodal framework that couples structural MRI with six clinical modalities and a rule-based verifier to generate structured, NIA-AA-consistent diagnoses. AD-Reasoning combines modality-specific encoders, bidirectional cross-attention fusion, and reinforcement fine-tuning with verifiable rewards that enforce output format, guideline evidence coverage, and reasoning--decision consistency. We also release AD-MultiSense, a 10,378-visit multimodal QA dataset with guideline-validated rationales built from ADNI/AIBL. On AD-MultiSense, AD-Reasoning achieves state-of-the-art diagnostic accuracy and produces structured rationales that improve transparency over recent baselines, while providing transparent rationales.

* ICME 2026

Via

Access Paper or Ask Questions

EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer's Disease

Feb 22, 2026

Qiuhui Chen, Xuancheng Yao, Zhenglei Zhou, Xinyue Hu, Yi Hong

Abstract:Deep learning models for medical image analysis often act as black boxes, seldom aligning with clinical guidelines or explicitly linking decisions to supporting evidence. This is especially critical in Alzheimer's disease (AD), where predictions should be grounded in both anatomical and clinical findings. We present EMAD, a vision-language framework that generates structured AD diagnostic reports in which each claim is explicitly grounded in multimodal evidence. EMAD uses a hierarchical Sentence-Evidence-Anatomy (SEA) grounding mechanism: (i) sentence-to-evidence grounding links generated sentences to clinical evidence phrases, and (ii) evidence-to-anatomy grounding localizes corresponding structures on 3D brain MRI. To reduce dense annotation requirements, we propose GTX-Distill, which transfers grounding behavior from a teacher trained with limited supervision to a student operating on model-generated reports. We further introduce Executable-Rule GRPO, a reinforcement fine-tuning scheme with verifiable rewards that enforces clinical consistency, protocol adherence, and reasoning-diagnosis coherence. On the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic accuracy and produces more transparent, anatomically faithful reports than existing methods. We will release code and grounding annotations to support future research in trustworthy medical vision-language models.

* Accepted by CVPR2026

Via

Access Paper or Ask Questions

Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models

Sep 11, 2025

Qiuhui Chen, Xuancheng Yao, Huping Ye, Yi Hong

Abstract:Understanding 3D medical image volumes is critical in the medical field, yet existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension. Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions. To leverage these 2D MLLMs for improved 3D medical image understanding, we propose Med3DInsight, a novel pretraining framework that integrates 3D image encoders with 2D MLLMs via a specially designed plane-slice-aware transformer module. Additionally, our model employs a partial optimal transport based alignment, demonstrating greater tolerance to noise introduced by potential noises in LLM-generated content. Med3DInsight introduces a new paradigm for scalable multimodal 3D medical representation learning without requiring human annotations. Extensive experiments demonstrate our state-of-the-art performance on two downstream tasks, i.e., segmentation and classification, across various public datasets with CT and MRI modalities, outperforming current SSL methods. Med3DInsight can be seamlessly integrated into existing 3D medical image understanding networks, potentially enhancing their performance. Our source code, generated datasets, and pre-trained models will be available at https://github.com/Qybc/Med3DInsight.

* Accepted by IEEE Journal of Biomedical and Health Informatics (JBHI)

Via

Access Paper or Ask Questions

HoloDx: Knowledge- and Data-Driven Multimodal Diagnosis of Alzheimer's Disease

Apr 27, 2025

Qiuhui Chen, Jintao Wang, Gang Wang, Yi Hong

Figure 1 for HoloDx: Knowledge- and Data-Driven Multimodal Diagnosis of Alzheimer's Disease

Figure 2 for HoloDx: Knowledge- and Data-Driven Multimodal Diagnosis of Alzheimer's Disease

Figure 3 for HoloDx: Knowledge- and Data-Driven Multimodal Diagnosis of Alzheimer's Disease

Figure 4 for HoloDx: Knowledge- and Data-Driven Multimodal Diagnosis of Alzheimer's Disease

Abstract:Accurate diagnosis of Alzheimer's disease (AD) requires effectively integrating multimodal data and clinical expertise. However, existing methods often struggle to fully utilize multimodal information and lack structured mechanisms to incorporate dynamic domain knowledge. To address these limitations, we propose HoloDx, a knowledge- and data-driven framework that enhances AD diagnosis by aligning domain knowledge with multimodal clinical data. HoloDx incorporates a knowledge injection module with a knowledge-aware gated cross-attention, allowing the model to dynamically integrate domain-specific insights from both large language models (LLMs) and clinical expertise. Also, a memory injection module with a designed prototypical memory attention enables the model to retain and retrieve subject-specific information, ensuring consistency in decision-making. By jointly leveraging these mechanisms, HoloDx enhances interpretability, improves robustness, and effectively aligns prior knowledge with current subject data. Evaluations on five AD datasets demonstrate that HoloDx outperforms state-of-the-art methods, achieving superior diagnostic accuracy and strong generalization across diverse cohorts. The source code will be released upon publication acceptance.

Via

Access Paper or Ask Questions

Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models

Mar 08, 2024

Qiuhui Chen, Huping Ye, Yi Hong

Figure 1 for Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models

Figure 2 for Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models

Figure 3 for Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models

Figure 4 for Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models

Abstract:Understanding 3D medical image volumes is a critical task in the medical domain. However, existing 3D convolution and transformer-based methods have limited semantic understanding of an image volume and also need a large set of volumes for training. Recent advances in multi-modal large language models (MLLMs) provide a new and promising way to understand images with the help of text descriptions. However, most current MLLMs are designed for 2D natural images. To enhance the 3D medical image understanding with 2D MLLMs, we propose a novel pre-training framework called Med3DInsight, which marries existing 3D image encoders with 2D MLLMs and bridges them via a designed Plane-Slice-Aware Transformer (PSAT) module. Extensive experiments demonstrate our SOTA performance on two downstream segmentation and classification tasks, including three public datasets with CT and MRI modalities and comparison to more than ten baselines. Med3DInsight can be easily integrated into any current 3D medical image understanding network and improves its performance by a good margin.

Via

Access Paper or Ask Questions

AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided Diagnosis

Jan 07, 2024

Qiuhui Chen, Yi Hong

Figure 1 for AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided Diagnosis

Figure 2 for AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided Diagnosis

Figure 3 for AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided Diagnosis

Figure 4 for AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided Diagnosis

Abstract:Medical data collected for making a diagnostic decision are typically multi-modal and provide complementary perspectives of a subject. A computer-aided diagnosis system welcomes multi-modal inputs; however, how to effectively fuse such multi-modal data is a challenging task and attracts a lot of attention in the medical research field. In this paper, we propose a transformer-based framework, called Alifuse, for aligning and fusing multi-modal medical data. Specifically, we convert images and unstructured and structured texts into vision and language tokens, and use intramodal and intermodal attention mechanisms to learn holistic representations of all imaging and non-imaging data for classification. We apply Alifuse to classify Alzheimer's disease and obtain state-of-the-art performance on five public datasets, by outperforming eight baselines. The source code will be available online later.

Via

Access Paper or Ask Questions

Volumetric Medical Image Segmentation via Scribble Annotations and Shape Priors

Oct 12, 2023

Qiuhui Chen, Haiying Lyu, Xinyue Hu, Yong Lu, Yi Hong

Figure 1 for Volumetric Medical Image Segmentation via Scribble Annotations and Shape Priors

Figure 2 for Volumetric Medical Image Segmentation via Scribble Annotations and Shape Priors

Figure 3 for Volumetric Medical Image Segmentation via Scribble Annotations and Shape Priors

Figure 4 for Volumetric Medical Image Segmentation via Scribble Annotations and Shape Priors

Abstract:Recently, weakly-supervised image segmentation using weak annotations like scribbles has gained great attention in computer vision and medical image analysis, since such annotations are much easier to obtain compared to time-consuming and labor-intensive labeling at the pixel/voxel level. However, due to a lack of structure supervision on regions of interest (ROIs), existing scribble-based methods suffer from poor boundary localization. Furthermore, most current methods are designed for 2D image segmentation, which do not fully leverage the volumetric information if directly applied to each image slice. In this paper, we propose a scribble-based volumetric image segmentation, Scribble2D5, which tackles 3D anisotropic image segmentation and aims to its improve boundary prediction. To achieve this, we augment a 2.5D attention UNet with a proposed label propagation module to extend semantic information from scribbles and use a combination of static and active boundary prediction to learn ROI's boundary and regularize its shape. Also, we propose an optional add-on component, which incorporates the shape prior information from unpaired segmentation masks to further improve model accuracy. Extensive experiments on three public datasets and one private dataset demonstrate our Scribble2D5 achieves state-of-the-art performance on volumetric image segmentation using scribbles and shape prior if available.

* arXiv admin note: text overlap with arXiv:2205.06779

Via

Access Paper or Ask Questions

MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

May 18, 2023

Qiuhui Chen, Xinyue Hu, Zirui Wang, Yi Hong

Figure 1 for MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Figure 2 for MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Figure 3 for MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Figure 4 for MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Abstract:Vision-language pre-training (VLP) models have been demonstrated to be effective in many computer vision applications. In this paper, we consider developing a VLP model in the medical domain for making computer-aided diagnoses (CAD) based on image scans and text descriptions in electronic health records, as done in practice. To achieve our goal, we present a lightweight CAD system MedBLIP, a new paradigm for bootstrapping VLP from off-the-shelf frozen pre-trained image encoders and frozen large language models. We design a MedQFormer module to bridge the gap between 3D medical images and 2D pre-trained image encoders and language models as well. To evaluate the effectiveness of our MedBLIP, we collect more than 30,000 image volumes from five public Alzheimer's disease (AD) datasets, i.e., ADNI, NACC, OASIS, AIBL, and MIRIAD. On this largest AD dataset we know, our model achieves the SOTA performance on the zero-shot classification of healthy, mild cognitive impairment (MCI), and AD subjects, and shows its capability of making medical visual question answering (VQA). The code and pre-trained models is available online: https://github.com/Qybc/MedBLIP.

* 11 pages, 3 figures

Via

Access Paper or Ask Questions

Longformer: Longitudinal Transformer for Alzheimer's Disease Classification with Structural MRIs

Feb 02, 2023

Qiuhui Chen, Yi Hong

Figure 1 for Longformer: Longitudinal Transformer for Alzheimer's Disease Classification with Structural MRIs

Figure 2 for Longformer: Longitudinal Transformer for Alzheimer's Disease Classification with Structural MRIs

Figure 3 for Longformer: Longitudinal Transformer for Alzheimer's Disease Classification with Structural MRIs

Figure 4 for Longformer: Longitudinal Transformer for Alzheimer's Disease Classification with Structural MRIs

Abstract:Structural magnetic resonance imaging (sMRI) is widely used for brain neurological disease diagnosis; while longitudinal MRIs are often collected to monitor and capture disease progression, as clinically used in diagnosing Alzheimer's disease (AD). However, most current methods neglect AD's progressive nature and only take a single sMRI for recognizing AD. In this paper, we consider the problem of leveraging the longitudinal MRIs of a subject for AD identification. To capture longitudinal changes in sMRIs, we propose a novel model Longformer, a spatiotemporal transformer network that performs attention mechanisms spatially on sMRIs at each time point and integrates brain region features over time to obtain longitudinal embeddings for classification. Our Longformer achieves state-of-the-art performance on two binary classification tasks of separating different stages of AD using the ADNI dataset. Our source code is available at https://github.com/Qybc/LongFormer.

* 9 pages, 5 figures

Via

Access Paper or Ask Questions