Parameter-efficient fine-tuning (PEFT) is proposed as a cost-effective way to transfer pre-trained models to downstream tasks, avoiding the high cost of updating entire large-scale pre-trained models (LPMs). In this work, we present Fine-grained Prompt Tuning (FPT), a novel PEFT method for medical image classification. FPT significantly reduces memory consumption compared to other PEFT methods, especially in high-resolution contexts. To achieve this, we first freeze the weights of the LPM and construct a learnable lightweight side network. The frozen LPM takes high-resolution images as input to extract fine-grained features, while the side network is fed low-resolution images to reduce memory usage. To allow the side network to access pre-trained knowledge, we introduce fine-grained prompts that summarize information from the LPM through a fusion module. Important tokens selection and preloading techniques are employed to further reduce training cost and memory requirements. We evaluate FPT on four medical datasets with varying sizes, modalities, and complexities. Experimental results demonstrate that FPT achieves comparable performance to fine-tuning the entire LPM while using only 1.8% of the learnable parameters and 13% of the memory costs of an encoder ViT-B model with a 512 x 512 input resolution.
Contrastive learning based vision-language joint pre-training has emerged as a successful representation learning strategy. In this paper, we present a prototype representation learning framework incorporating both global and local alignment between medical images and reports. In contrast to standard global multi-modality alignment methods, we employ a local alignment module for fine-grained representation. Furthermore, a cross-modality conditional reconstruction module is designed to interchange information across modalities in the training phase by reconstructing masked images and reports. For reconstructing long reports, a sentence-wise prototype memory bank is constructed, enabling the network to focus on low-level localized visual and high-level clinical linguistic features. Additionally, a non-auto-regressive generation paradigm is proposed for reconstructing non-sequential reports. Experimental results on five downstream tasks, including supervised classification, zero-shot classification, image-to-text retrieval, semantic segmentation, and object detection, show the proposed method outperforms other state-of-the-art methods across multiple datasets and under different dataset size settings. The code is available at https://github.com/QtacierP/PRIOR.
The quality of a fundus image can be compromised by numerous factors, many of which are challenging to be appropriately and mathematically modeled. In this paper, we introduce a novel diffusion model based framework, named Learning Enhancement from Degradation (LED), for enhancing fundus images. Specifically, we first adopt a data-driven degradation framework to learn degradation mappings from unpaired high-quality to low-quality images. We then apply a conditional diffusion model to learn the inverse enhancement process in a paired manner. The proposed LED is able to output enhancement results that maintain clinically important features with better clarity. Moreover, in the inference phase, LED can be easily and effectively integrated with any existing fundus image enhancement framework. We evaluate the proposed LED on several downstream tasks with respect to various clinically-relevant metrics, successfully demonstrating its superiority over existing state-of-the-art methods both quantitatively and qualitatively. The source code is available at https://github.com/QtacierP/LED.
Medical image quality assessment (MIQA) is a vital prerequisite in various medical image analysis applications. Most existing MIQA algorithms are fully supervised that request a large amount of annotated data. However, annotating medical images is time-consuming and labor-intensive. In this paper, we propose an unsupervised anomaly-aware framework with test-time clustering for optical coherence tomography angiography (OCTA) image quality assessment in a setting wherein only a set of high-quality samples are accessible in the training phase. Specifically, a feature-embedding-based low-quality representation module is proposed to quantify the quality of OCTA images and then to discriminate between outstanding quality and non-outstanding quality. Within the non-outstanding quality class, to further distinguish gradable images from ungradable ones, we perform dimension reduction and clustering of multi-scale image features extracted by the trained OCTA quality representation network. Extensive experiments are conducted on one publicly accessible dataset sOCTA-3*3-10k, with superiority of our proposed framework being successfully established.
Convolutional neural networks have been widely applied to medical image segmentation and have achieved considerable performance. However, the performance may be significantly affected by the domain gap between training data (source domain) and testing data (target domain). To address this issue, we propose a data manipulation based domain generalization method, called Automated Augmentation for Domain Generalization (AADG). Our AADG framework can effectively sample data augmentation policies that generate novel domains and diversify the training set from an appropriate search space. Specifically, we introduce a novel proxy task maximizing the diversity among multiple augmented novel domains as measured by the Sinkhorn distance in a unit sphere space, making automated augmentation tractable. Adversarial training and deep reinforcement learning are employed to efficiently search the objectives. Quantitative and qualitative experiments on 11 publicly-accessible fundus image datasets (four for retinal vessel segmentation, four for optic disc and cup (OD/OC) segmentation and three for retinal lesion segmentation) are comprehensively performed. Two OCTA datasets for retinal vasculature segmentation are further involved to validate cross-modality generalization. Our proposed AADG exhibits state-of-the-art generalization performance and outperforms existing approaches by considerable margins on retinal vessel, OD/OC and lesion segmentation tasks. The learned policies are empirically validated to be model-agnostic and can transfer well to other models. The source code is available at https://github.com/CRazorback/AADG.
Due to the high cost of manually annotating medical images, especially for large-scale datasets, anomaly detection has been explored through training models with only normal data. Lacking prior knowledge of true anomalies is the main reason for the limited application of previous anomaly detection methods, especially in the medical image analysis realm. In this work, we propose a one-shot anomaly detection framework, namely LesionPaste, that utilizes true anomalies from a single annotated sample and synthesizes artificial anomalous samples for anomaly detection. First, a lesion bank is constructed by applying augmentation to randomly selected lesion patches. Then, MixUp is adopted to paste patches from the lesion bank at random positions in normal images to synthesize anomalous samples for training. Finally, a classification network is trained using the synthetic abnormal samples and the true normal data. Extensive experiments are conducted on two publicly-available medical image datasets with different types of abnormalities. On both datasets, our proposed LesionPaste largely outperforms several state-of-the-art unsupervised and semi-supervised anomaly detection methods, and is on a par with the fully-supervised counterpart. To note, LesionPaste is even better than the fully-supervised method in detecting early-stage diabetic retinopathy.
Although deep learning based diabetic retinopathy (DR) classification methods typically benefit from well-designed architectures of convolutional neural networks, the training setting also has a non-negligible impact on the prediction performance. The training setting includes various interdependent components, such as objective function, data sampling strategy and data augmentation approach. To identify the key components in a standard deep learning framework (ResNet-50) for DR grading, we systematically analyze the impact of several major components. Extensive experiments are conducted on a publicly-available dataset EyePACS. We demonstrate that (1) the ResNet-50 framework for DR grading is sensitive to input resolution, objective function, and composition of data augmentation, (2) using mean square error as the loss function can effectively improve the performance with respect to a task-specific evaluation metric, namely the quadratically-weighted Kappa, (3) utilizing eye pairs boosts the performance of DR grading and (4) using data resampling to address the problem of imbalanced data distribution in EyePACS hurts the performance. Based on these observations and an optimal combination of the investigated components, our framework, without any specialized network design, achieves the state-of-the-art result (0.8631 for Kappa) on the EyePACS test set (a total of 42670 fundus images) with only image-level labels. Our codes and pre-trained model are available at https://github.com/YijinHuang/pytorch-classification
Manually annotating medical images is extremely expensive, especially for large-scale datasets. Self-supervised contrastive learning has been explored to learn feature representations from unlabeled images. However, unlike natural images, the application of contrastive learning to medical images is relatively limited. In this work, we propose a self-supervised framework, namely lesion-based contrastive learning for automated diabetic retinopathy (DR) grading. Instead of taking entire images as the input in the common contrastive learning scheme, lesion patches are employed to encourage the feature extractor to learn representations that are highly discriminative for DR grading. We also investigate different data augmentation operations in defining our contrastive prediction task. Extensive experiments are conducted on the publicly-accessible dataset EyePACS, demonstrating that our proposed framework performs outstandingly on DR grading in terms of both linear evaluation and transfer capacity evaluation.
Optical coherence tomography angiography (OCTA) is a novel non-invasive imaging technique that allows visualizations of vasculature and foveal avascular zone (FAZ) across retinal layers. Clinical researches suggest that the morphology and contour irregularity of FAZ are important biomarkers of various ocular pathologies. Therefore, precise segmentation of FAZ has great clinical interest. Also, there is no existing research reporting that FAZ features can improve the performance of deep diagnostic classification networks. In this paper, we propose a novel multi-level boundary shape and distance aware joint learning framework, named BSDA-Net, for FAZ segmentation and diagnostic classification from OCTA images. Two auxiliary branches, namely boundary heatmap regression and signed distance map reconstruction branches, are constructed in addition to the segmentation branch to improve the segmentation performance, resulting in more accurate FAZ contours and fewer outliers. Moreover, both low-level and high-level features from the aforementioned three branches, including shape, size, boundary, and signed directional distance map of FAZ, are fused hierarchically with features from the diagnostic classifier. Through extensive experiments, the proposed BSDA-Net is found to yield state-of-the-art segmentation and classification results on the OCTA-500, OCTAGON, and FAZID datasets.