Interpretability of deep learning (DL) systems is gaining attention in medical imaging to increase experts' trust in the obtained predictions and facilitate their integration in clinical settings. We propose a deep visualization method to generate interpretability of DL classification tasks in medical imaging by means of visual evidence augmentation. The proposed method iteratively unveils abnormalities based on the prediction of a classifier trained only with image-level labels. For each image, initial visual evidence of the prediction is extracted with a given visual attribution technique. This provides localization of abnormalities that are then removed through selective inpainting. We iteratively apply this procedure until the system considers the image as normal. This yields augmented visual evidence, including less discriminative lesions which were not detected at first but should be considered for final diagnosis. We apply the method to grading of two retinal diseases in color fundus images: diabetic retinopathy (DR) and age-related macular degeneration (AMD). We evaluate the generated visual evidence and the performance of weakly-supervised localization of different types of DR and AMD abnormalities, both qualitatively and quantitatively. We show that the augmented visual evidence of the predictions highlights the biomarkers considered by the experts for diagnosis and improves the final localization performance. It results in a relative increase of 11.2$\pm$2.0% per image regarding average sensitivity per average 10 false positives, when applied to different classification tasks, visual attribution techniques and network architectures. This makes the proposed method a useful tool for exhaustive visual support of DL classifiers in medical imaging.
The number of biomedical image analysis challenges organized per year is steadily increasing. These international competitions have the purpose of benchmarking algorithms on common data sets, typically to identify the best method for a given problem. Recent research, however, revealed that common practice related to challenge reporting does not allow for adequate interpretation and reproducibility of results. To address the discrepancy between the impact of challenges and the quality (control), the Biomedical I mage Analysis ChallengeS (BIAS) initiative developed a set of recommendations for the reporting of challenges. The BIAS statement aims to improve the transparency of the reporting of a biomedical image analysis challenge regardless of field of application, image modality or task category assessed. This article describes how the BIAS statement was developed and presents a checklist which authors of biomedical image analysis challenges are encouraged to include in their submission when giving a paper on a challenge into review. The purpose of the checklist is to standardize and facilitate the review process and raise interpretability and reproducibility of challenge results by making relevant information explicit.
Chronic Obstructive Pulmonary Disease (COPD) is a leading cause of morbidity and mortality. While COPD diagnosis is based on lung function tests, early stages and progression of different aspects of the disease can be visible and quantitatively assessed on computed tomography (CT) scans. Many studies have been published that quantify imaging biomarkers related to COPD. In this paper we present a convolutional neural network that directly computes visual emphysema scores and predicts the outcome of lung function tests for 195 CT scans from the COPDGene study. Contrary to previous work, the proposed method does not encode any specific prior knowledge about what to quantify, but it is trained end-to-end with a set of 1424 CT scans for which the output parameters were available. The network provided state-of-the-art results for these tasks: Visual emphysema scores are comparable to those assessed by trained human observers; COPD diagnosis from estimated lung function reaches an area under the ROC curve of 0.94, outperforming prior art. The method is easily generalizable to other situations where information from whole scans needs to be summarized in single quantities.
We present a novel multilevel approach for deep learning based image registration. Recently published deep learning based registration methods have shown promising results for a wide range of tasks. However, these algorithms are still limited to relatively small deformations. Our method addresses this shortcoming by introducing a multilevel framework, which computes deformation fields on different scales, similar to conventional methods. Thereby, a coarse-level alignment is obtained first, which is subsequently improved on finer levels. We demonstrate our method on the complex task of inhale-to-exhale lung registration. We show that the use of a deep learning multilevel approach leads to significantly better registration results.
Purpose: To develop and validate a deep learning model for automatic segmentation of geographic atrophy (GA) in color fundus images (CFIs) and its application to study growth rate of GA. Participants: 409 CFIs of 238 eyes with GA from the Rotterdam Study (RS) and the Blue Mountain Eye Study (BMES) for model development, and 5,379 CFIs of 625 eyes from the Age-Related Eye Disease Study (AREDS) for analysis of GA growth rate. Methods: A deep learning model based on an ensemble of encoder-decoder architectures was implemented and optimized for the segmentation of GA in CFIs. Four experienced graders delineated GA in CFIs from RS and BMES. These manual delineations were used to evaluate the segmentation model using 5-fold cross-validation. The model was further applied to CFIs from the AREDS to study the growth rate of GA. Linear regression analysis was used to study associations between structural biomarkers at baseline and GA growth rate. A general estimate of the progression of GA area over time was made by combining growth rates of all eyes with GA from the AREDS set. Results: The model obtained an average Dice coefficient of 0.72 $\pm$ 0.26 on the BMES and RS. An intraclass correlation coefficient of 0.83 was reached between the automatically estimated GA area and the graders' consensus measures. Eight automatically calculated structural biomarkers (area, filled area, convex area, convex solidity, eccentricity, roundness, foveal involvement and perimeter) were significantly associated with growth rate. Combining all growth rates indicated that GA area grows quadratically up to an area of around 12 mm$^{2}$, after which growth rate stabilizes or decreases. Conclusion: The presented deep learning model allowed for fully automatic and robust segmentation of GA in CFIs. These segmentations can be used to extract structural characteristics of GA that predict its growth rate.
The Gleason score is the most important prognostic marker for prostate cancer patients but suffers from significant inter-observer variability. We developed a fully automated deep learning system to grade prostate biopsies. The system was developed using 5834 biopsies from 1243 patients. A semi-automatic labeling technique was used to circumvent the need for full manual annotation by pathologists. The developed system achieved a high agreement with the reference standard. In a separate observer experiment, the deep learning system outperformed 10 out of 15 pathologists. The system has the potential to improve prostate cancer prognostics by acting as a first or second reader.
In this work, we propose a method to reject out-of-distribution samples which can be adapted to any network architecture and requires no additional training data. Publicly available chest x-ray data (38,353 images) is used to train a standard ResNet-50 model to detect emphysema. Feature activations of intermediate layers are used as descriptors defining the training data distribution. A novel metric, FRODO, is measured by using the Mahalanobis distance of a new test sample to the training data distribution. The method is tested using a held-out test dataset of 21,176 chest x-rays (in-distribution) and a set of 14,821 out-of-distribution x-ray images of incorrect orientation or anatomy. In classifying test samples as in or out-of distribution, our method achieves an AUC score of 0.99.
Purpose: To validate the performance of a commercially-available, CE-certified deep learning (DL) system, RetCAD v.1.3.0 (Thirona, Nijmegen, The Netherlands), for the joint automatic detection of diabetic retinopathy (DR) and age-related macular degeneration (AMD) in color fundus (CF) images on a dataset with mixed presence of eye diseases. Methods: Evaluation of joint detection of referable DR and AMD was performed on a DR-AMD dataset with 600 images acquired during routine clinical practice, containing referable and non-referable cases of both diseases. Each image was graded for DR and AMD by an experienced ophthalmologist to establish the reference standard (RS), and by four independent observers for comparison with human performance. Validation was furtherly assessed on Messidor (1200 images) for individual identification of referable DR, and the Age-Related Eye Disease Study (AREDS) dataset (133821 images) for referable AMD, against the corresponding RS. Results: Regarding joint validation on the DR-AMD dataset, the system achieved an area under the ROC curve (AUC) of 95.1% for detection of referable DR (SE=90.1%, SP=90.6%). For referable AMD, the AUC was 94.9% (SE=91.8%, SP=87.5%). Average human performance for DR was SE=61.5% and SP=97.8%; for AMD, SE=76.5% and SP=96.1%. Regarding detection of referable DR in Messidor, AUC was 97.5% (SE=92.0%, SP=92.1%); for referable AMD in AREDS, AUC was 92.7% (SE=85.8%, SP=86.0%). Conclusions: The validated system performs comparably to human experts at simultaneous detection of DR and AMD. This shows that DL systems can facilitate access to joint screening of eye diseases and become a quick and reliable support for ophthalmological experts.
There is a growing interest in the automated analysis of chest X-Ray (CXR) as a sensitive and inexpensive means of screening susceptible populations for pulmonary tuberculosis. In this work we evaluate the latest version of CAD4TB, a software platform designed for this purpose. Version 6 of CAD4TB was released in 2018 and is here tested on an independent dataset of 5565 CXR images with GeneXpert (Xpert) sputum test results available (854 Xpert positive subjects). A subset of 500 subjects (50% Xpert positive) was reviewed and annotated by 5 expert observers independently to obtain a radiological reference standard. The latest version of CAD4TB is found to outperform all previous versions in terms of area under receiver operating curve (ROC) with respect to both Xpert and radiological reference standards. Improvements with respect to Xpert are most apparent at high sensitivity levels with a specificity of 76% obtained at 90% sensitivity. When compared with the radiological reference standard, CAD4TB v6 also outperformed previous versions by a considerable margin and achieved 98% specificity at 90% sensitivity. No substantial difference was found between the performance of CAD4TB v6 and any of the various expert observers against the Xpert reference standard. A cost and efficiency analysis on this dataset demonstrates that in a standard clinical situation, operating at 90% sensitivity, users of CAD4TB v6 can process 132 subjects per day at an average cost per screen of \$5.95 per subject, while users of version 3 process only 85 subjects per day at a cost of \$8.41 per subject. At all tested operating points version 6 is shown to be more efficient and cost effective than any other version.
Semantic segmentation of medical images aims to associate a pixel with a label in a medical image without human initialization. The success of semantic segmentation algorithms is contingent on the availability of high-quality imaging data with corresponding labels provided by experts. We sought to create a large collection of annotated medical image datasets of various clinically relevant anatomies available under open source license to facilitate the development of semantic segmentation algorithms. Such a resource would allow: 1) objective assessment of general-purpose segmentation methods through comprehensive benchmarking and 2) open and free access to medical image data for any researcher interested in the problem domain. Through a multi-institutional effort, we generated a large, curated dataset representative of several highly variable segmentation tasks that was used in a crowd-sourced challenge - the Medical Segmentation Decathlon held during the 2018 Medical Image Computing and Computer Aided Interventions Conference in Granada, Spain. Here, we describe these ten labeled image datasets so that these data may be effectively reused by the research community.