Alert button
Picture for Mobarakol Islam

Mobarakol Islam

Alert button

SAM Meets Robotic Surgery: An Empirical Study on Generalization, Robustness and Adaptation

Aug 14, 2023
An Wang, Mobarakol Islam, Mengya Xu, Yang Zhang, Hongliang Ren

Figure 1 for SAM Meets Robotic Surgery: An Empirical Study on Generalization, Robustness and Adaptation
Figure 2 for SAM Meets Robotic Surgery: An Empirical Study on Generalization, Robustness and Adaptation
Figure 3 for SAM Meets Robotic Surgery: An Empirical Study on Generalization, Robustness and Adaptation
Figure 4 for SAM Meets Robotic Surgery: An Empirical Study on Generalization, Robustness and Adaptation

The Segment Anything Model (SAM) serves as a fundamental model for semantic segmentation and demonstrates remarkable generalization capabilities across a wide range of downstream scenarios. In this empirical study, we examine SAM's robustness and zero-shot generalizability in the field of robotic surgery. We comprehensively explore different scenarios, including prompted and unprompted situations, bounding box and points-based prompt approaches, as well as the ability to generalize under corruptions and perturbations at five severity levels. Additionally, we compare the performance of SAM with state-of-the-art supervised models. We conduct all the experiments with two well-known robotic instrument segmentation datasets from MICCAI EndoVis 2017 and 2018 challenges. Our extensive evaluation results reveal that although SAM shows remarkable zero-shot generalization ability with bounding box prompts, it struggles to segment the whole instrument with point-based prompts and unprompted settings. Furthermore, our qualitative figures demonstrate that the model either failed to predict certain parts of the instrument mask (e.g., jaws, wrist) or predicted parts of the instrument as wrong classes in the scenario of overlapping instruments within the same bounding box or with the point-based prompt. In fact, SAM struggles to identify instruments in complex surgical scenarios characterized by the presence of blood, reflection, blur, and shade. Additionally, SAM is insufficiently robust to maintain high performance when subjected to various forms of data corruption. We also attempt to fine-tune SAM using Low-rank Adaptation (LoRA) and propose SurgicalSAM, which shows the capability in class-wise mask prediction without prompt. Therefore, we can argue that, without further domain-specific fine-tuning, SAM is not ready for downstream surgical tasks.

* Accepted as Oral Presentation at MedAGI Workshop - MICCAI 2023 1st International Workshop on Foundation Models for General Medical AI. arXiv admin note: substantial text overlap with arXiv:2304.14674 
Viaarxiv icon

Robustness Stress Testing in Medical Image Classification

Aug 14, 2023
Mobarakol Islam, Zeju Li, Ben Glocker

Deep neural networks have shown impressive performance for image-based disease detection. Performance is commonly evaluated through clinical validation on independent test sets to demonstrate clinically acceptable accuracy. Reporting good performance metrics on test sets, however, is not always a sufficient indication of the generalizability and robustness of an algorithm. In particular, when the test data is drawn from the same distribution as the training data, the iid test set performance can be an unreliable estimate of the accuracy on new data. In this paper, we employ stress testing to assess model robustness and subgroup performance disparities in disease detection models. We design progressive stress testing using five different bidirectional and unidirectional image perturbations with six different severity levels. As a use case, we apply stress tests to measure the robustness of disease detection models for chest X-ray and skin lesion images, and demonstrate the importance of studying class and domain-specific model behaviour. Our experiments indicate that some models may yield more robust and equitable performance than others. We also find that pretraining characteristics play an important role in downstream robustness. We conclude that progressive stress testing is a viable and important tool and should become standard practice in the clinical validation of image-based disease detection models.

* 11 pages 
Viaarxiv icon

Landmark Detection using Transformer Toward Robot-assisted Nasal Airway Intubation

Aug 05, 2023
Tianhang Liu, Hechen Li, Long Bai, Yanan Wu, An Wang, Mobarakol Islam, Hongliang Ren

Figure 1 for Landmark Detection using Transformer Toward Robot-assisted Nasal Airway Intubation
Figure 2 for Landmark Detection using Transformer Toward Robot-assisted Nasal Airway Intubation
Figure 3 for Landmark Detection using Transformer Toward Robot-assisted Nasal Airway Intubation
Figure 4 for Landmark Detection using Transformer Toward Robot-assisted Nasal Airway Intubation

Robot-assisted airway intubation application needs high accuracy in locating targets and organs. Two vital landmarks, nostrils and glottis, can be detected during the intubation to accommodate the stages of nasal intubation. Automated landmark detection can provide accurate localization and quantitative evaluation. The Detection Transformer (DeTR) leads object detectors to a new paradigm with long-range dependence. However, current DeTR requires long iterations to converge, and does not perform well in detecting small objects. This paper proposes a transformer-based landmark detection solution with deformable DeTR and the semantic-aligned-matching module for detecting landmarks in robot-assisted intubation. The semantics aligner can effectively align the semantics of object queries and image features in the same embedding space using the most discriminative features. To evaluate the performance of our solution, we utilize a publicly accessible glottis dataset and automatically annotate a nostril detection dataset. The experimental results demonstrate our competitive performance in detection accuracy. Our code is publicly accessible.

* ICBIR 2023 (Best Student Paper Award). Code availability: https://github.com/ConorLTH/airway_intubation_landmarks_detection 
Viaarxiv icon

Revisiting Distillation for Continual Learning on Visual Question Localized-Answering in Robotic Surgery

Jul 22, 2023
Long Bai, Mobarakol Islam, Hongliang Ren

Figure 1 for Revisiting Distillation for Continual Learning on Visual Question Localized-Answering in Robotic Surgery
Figure 2 for Revisiting Distillation for Continual Learning on Visual Question Localized-Answering in Robotic Surgery
Figure 3 for Revisiting Distillation for Continual Learning on Visual Question Localized-Answering in Robotic Surgery
Figure 4 for Revisiting Distillation for Continual Learning on Visual Question Localized-Answering in Robotic Surgery

The visual-question localized-answering (VQLA) system can serve as a knowledgeable assistant in surgical education. Except for providing text-based answers, the VQLA system can highlight the interested region for better surgical scene understanding. However, deep neural networks (DNNs) suffer from catastrophic forgetting when learning new knowledge. Specifically, when DNNs learn on incremental classes or tasks, their performance on old tasks drops dramatically. Furthermore, due to medical data privacy and licensing issues, it is often difficult to access old data when updating continual learning (CL) models. Therefore, we develop a non-exemplar continual surgical VQLA framework, to explore and balance the rigidity-plasticity trade-off of DNNs in a sequential learning paradigm. We revisit the distillation loss in CL tasks, and propose rigidity-plasticity-aware distillation (RP-Dist) and self-calibrated heterogeneous distillation (SH-Dist) to preserve the old knowledge. The weight aligning (WA) technique is also integrated to adjust the weight bias between old and new tasks. We further establish a CL framework on three public surgical datasets in the context of surgical settings that consist of overlapping classes between old and new surgical VQLA tasks. With extensive experiments, we demonstrate that our proposed method excellently reconciles learning and forgetting on the continual surgical VQLA over conventional CL methods. Our code is publicly accessible.

* To appear in MICCAI 2023. Code availability: https://github.com/longbai1006/CS-VQLA 
Viaarxiv icon

LLCaps: Learning to Illuminate Low-Light Capsule Endoscopy with Curved Wavelet Attention and Reverse Diffusion

Jul 22, 2023
Long Bai, Tong Chen, Yanan Wu, An Wang, Mobarakol Islam, Hongliang Ren

Figure 1 for LLCaps: Learning to Illuminate Low-Light Capsule Endoscopy with Curved Wavelet Attention and Reverse Diffusion
Figure 2 for LLCaps: Learning to Illuminate Low-Light Capsule Endoscopy with Curved Wavelet Attention and Reverse Diffusion
Figure 3 for LLCaps: Learning to Illuminate Low-Light Capsule Endoscopy with Curved Wavelet Attention and Reverse Diffusion
Figure 4 for LLCaps: Learning to Illuminate Low-Light Capsule Endoscopy with Curved Wavelet Attention and Reverse Diffusion

Wireless capsule endoscopy (WCE) is a painless and non-invasive diagnostic tool for gastrointestinal (GI) diseases. However, due to GI anatomical constraints and hardware manufacturing limitations, WCE vision signals may suffer from insufficient illumination, leading to a complicated screening and examination procedure. Deep learning-based low-light image enhancement (LLIE) in the medical field gradually attracts researchers. Given the exuberant development of the denoising diffusion probabilistic model (DDPM) in computer vision, we introduce a WCE LLIE framework based on the multi-scale convolutional neural network (CNN) and reverse diffusion process. The multi-scale design allows models to preserve high-resolution representation and context information from low-resolution, while the curved wavelet attention (CWA) block is proposed for high-frequency and local feature learning. Furthermore, we combine the reverse diffusion procedure to further optimize the shallow output and generate the most realistic image. The proposed method is compared with ten state-of-the-art (SOTA) LLIE methods and significantly outperforms quantitatively and qualitatively. The superior performance on GI disease segmentation further demonstrates the clinical potential of our proposed model. Our code is publicly accessible.

* To appear in MICCAI 2023. Code availability: https://github.com/longbai1006/LLCaps 
Viaarxiv icon

CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Jul 22, 2023
Long Bai, Mobarakol Islam, Hongliang Ren

Figure 1 for CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
Figure 2 for CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
Figure 3 for CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
Figure 4 for CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Medical students and junior surgeons often rely on senior surgeons and specialists to answer their questions when learning surgery. However, experts are often busy with clinical and academic work, and have little time to give guidance. Meanwhile, existing deep learning (DL)-based surgical Visual Question Answering (VQA) systems can only provide simple answers without the location of the answers. In addition, vision-language (ViL) embedding is still a less explored research in these kinds of tasks. Therefore, a surgical Visual Question Localized-Answering (VQLA) system would be helpful for medical students and junior surgeons to learn and understand from recorded surgical videos. We propose an end-to-end Transformer with the Co-Attention gaTed Vision-Language (CAT-ViL) embedding for VQLA in surgical scenarios, which does not require feature extraction through detection models. The CAT-ViL embedding module is designed to fuse multimodal features from visual and textual sources. The fused embedding will feed a standard Data-Efficient Image Transformer (DeiT) module, before the parallel classifier and detector for joint prediction. We conduct the experimental validation on public surgical videos from MICCAI EndoVis Challenge 2017 and 2018. The experimental results highlight the superior performance and robustness of our proposed model compared to the state-of-the-art approaches. Ablation studies further prove the outstanding performance of all the proposed components. The proposed method provides a promising solution for surgical scene understanding, and opens up a primary step in the Artificial Intelligence (AI)-based VQLA system for surgical training. Our code is publicly available.

* To appear in MICCAI 2023. Code availability: https://github.com/longbai1006/CAT-ViL 
Viaarxiv icon

SimCol3D -- 3D Reconstruction during Colonoscopy Challenge

Jul 20, 2023
Anita Rau, Sophia Bano, Yueming Jin, Pablo Azagra, Javier Morlana, Edward Sanderson, Bogdan J. Matuszewski, Jae Young Lee, Dong-Jae Lee, Erez Posner, Netanel Frank, Varshini Elangovan, Sista Raviteja, Zhengwen Li, Jiquan Liu, Seenivasan Lalithkumar, Mobarakol Islam, Hongliang Ren, José M. M. Montiel, Danail Stoyanov

Figure 1 for SimCol3D -- 3D Reconstruction during Colonoscopy Challenge
Figure 2 for SimCol3D -- 3D Reconstruction during Colonoscopy Challenge
Figure 3 for SimCol3D -- 3D Reconstruction during Colonoscopy Challenge
Figure 4 for SimCol3D -- 3D Reconstruction during Colonoscopy Challenge

Colorectal cancer is one of the most common cancers in the world. While colonoscopy is an effective screening technique, navigating an endoscope through the colon to detect polyps is challenging. A 3D map of the observed surfaces could enhance the identification of unscreened colon tissue and serve as a training platform. However, reconstructing the colon from video footage remains unsolved due to numerous factors such as self-occlusion, reflective surfaces, lack of texture, and tissue deformation that limit feature-based methods. Learning-based approaches hold promise as robust alternatives, but necessitate extensive datasets. By establishing a benchmark, the 2022 EndoVis sub-challenge SimCol3D aimed to facilitate data-driven depth and pose prediction during colonoscopy. The challenge was hosted as part of MICCAI 2022 in Singapore. Six teams from around the world and representatives from academia and industry participated in the three sub-challenges: synthetic depth prediction, synthetic pose prediction, and real pose prediction. This paper describes the challenge, the submitted methods, and their results. We show that depth prediction in virtual colonoscopy is robustly solvable, while pose estimation remains an open research question.

Viaarxiv icon

Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Jul 11, 2023
Long Bai, Mobarakol Islam, Hongliang Ren

Figure 1 for Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
Figure 2 for Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
Figure 3 for Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
Figure 4 for Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Medical students and junior surgeons often rely on senior surgeons and specialists to answer their questions when learning surgery. However, experts are often busy with clinical and academic work, and have little time to give guidance. Meanwhile, existing deep learning (DL)-based surgical Visual Question Answering (VQA) systems can only provide simple answers without the location of the answers. In addition, vision-language (ViL) embedding is still a less explored research in these kinds of tasks. Therefore, a surgical Visual Question Localized-Answering (VQLA) system would be helpful for medical students and junior surgeons to learn and understand from recorded surgical videos. We propose an end-to-end Transformer with Co-Attention gaTed Vision-Language (CAT-ViL) for VQLA in surgical scenarios, which does not require feature extraction through detection models. The CAT-ViL embedding module is designed to fuse heterogeneous features from visual and textual sources. The fused embedding will feed a standard Data-Efficient Image Transformer (DeiT) module, before the parallel classifier and detector for joint prediction. We conduct the experimental validation on public surgical videos from MICCAI EndoVis Challenge 2017 and 2018. The experimental results highlight the superior performance and robustness of our proposed model compared to the state-of-the-art approaches. Ablation studies further prove the outstanding performance of all the proposed components. The proposed method provides a promising solution for surgical scene understanding, and opens up a primary step in the Artificial Intelligence (AI)-based VQLA system for surgical training. Our code is publicly available.

* To appear in MICCAI 2023. Code availability: https://github.com/longbai1006/CAT-ViL 
Viaarxiv icon

Generalizing Surgical Instruments Segmentation to Unseen Domains with One-to-Many Synthesis

Jun 28, 2023
An Wang, Mobarakol Islam, Mengya Xu, Hongliang Ren

Figure 1 for Generalizing Surgical Instruments Segmentation to Unseen Domains with One-to-Many Synthesis
Figure 2 for Generalizing Surgical Instruments Segmentation to Unseen Domains with One-to-Many Synthesis
Figure 3 for Generalizing Surgical Instruments Segmentation to Unseen Domains with One-to-Many Synthesis
Figure 4 for Generalizing Surgical Instruments Segmentation to Unseen Domains with One-to-Many Synthesis

Despite their impressive performance in various surgical scene understanding tasks, deep learning-based methods are frequently hindered from deploying to real-world surgical applications for various causes. Particularly, data collection, annotation, and domain shift in-between sites and patients are the most common obstacles. In this work, we mitigate data-related issues by efficiently leveraging minimal source images to generate synthetic surgical instrument segmentation datasets and achieve outstanding generalization performance on unseen real domains. Specifically, in our framework, only one background tissue image and at most three images of each foreground instrument are taken as the seed images. These source images are extensively transformed and employed to build up the foreground and background image pools, from which randomly sampled tissue and instrument images are composed with multiple blending techniques to generate new surgical scene images. Besides, we introduce hybrid training-time augmentations to diversify the training data further. Extensive evaluation on three real-world datasets, i.e., Endo2017, Endo2018, and RoboTool, demonstrates that our one-to-many synthetic surgical instruments datasets generation and segmentation framework can achieve encouraging performance compared with training with real data. Notably, on the RoboTool dataset, where a more significant domain gap exists, our framework shows its superiority of generalization by a considerable margin. We expect that our inspiring results will attract research attention to improving model generalization with data synthesizing.

* First two authors contributed equally. Accepted by IROS2023 
Viaarxiv icon