Abstract:Medical visual question answering (VQA) bridges the gap between visual information and clinical decision-making, enabling doctors to extract understanding from clinical images and videos. In particular, surgical VQA can enhance the interpretation of surgical data, aiding in accurate diagnoses, effective education, and clinical interventions. However, the inability of VQA models to visually indicate the regions of interest corresponding to the given questions results in incomplete comprehension of the surgical scene. To tackle this, we propose the surgical visual question localized-answering (VQLA) for precise and context-aware responses to specific queries regarding surgical images. Furthermore, to address the strong demand for safety in surgical scenarios and potential corruptions in image acquisition and transmission, we propose a novel approach called Calibrated Co-Attention Gated Vision-Language (C$^2$G-ViL) embedding to integrate and align multimodal information effectively. Additionally, we leverage the adversarial sample-based contrastive learning strategy to boost our performance and robustness. We also extend our EndoVis-18-VQLA and EndoVis-17-VQLA datasets to broaden the scope and application of our data. Extensive experiments on the aforementioned datasets demonstrate the remarkable performance and robustness of our solution. Our solution can effectively combat real-world image corruption. Thus, our proposed approach can serve as an effective tool for assisting surgical education, patient care, and enhancing surgical outcomes.
Abstract:As a crucial and intricate task in robotic minimally invasive surgery, reconstructing surgical scenes using stereo or monocular endoscopic video holds immense potential for clinical applications. NeRF-based techniques have recently garnered attention for the ability to reconstruct scenes implicitly. On the other hand, Gaussian splatting-based 3D-GS represents scenes explicitly using 3D Gaussians and projects them onto a 2D plane as a replacement for the complex volume rendering in NeRF. However, these methods face challenges regarding surgical scene reconstruction, such as slow inference, dynamic scenes, and surgical tool occlusion. This work explores and reviews state-of-the-art (SOTA) approaches, discussing their innovations and implementation principles. Furthermore, we replicate the models and conduct testing and evaluation on two datasets. The test results demonstrate that with advancements in these techniques, achieving real-time, high-quality reconstructions becomes feasible.




Abstract:The recent Segment Anything Model (SAM) 2 has demonstrated remarkable foundational competence in semantic segmentation, with its memory mechanism and mask decoder further addressing challenges in video tracking and object occlusion, thereby achieving superior results in interactive segmentation for both images and videos. Building upon our previous empirical studies, we further explore the zero-shot segmentation performance of SAM 2 in robot-assisted surgery based on prompts, alongside its robustness against real-world corruption. For static images, we employ two forms of prompts: 1-point and bounding box, while for video sequences, the 1-point prompt is applied to the initial frame. Through extensive experimentation on the MICCAI EndoVis 2017 and EndoVis 2018 benchmarks, SAM 2, when utilizing bounding box prompts, outperforms state-of-the-art (SOTA) methods in comparative evaluations. The results with point prompts also exhibit a substantial enhancement over SAM's capabilities, nearing or even surpassing existing unprompted SOTA methodologies. Besides, SAM 2 demonstrates improved inference speed and less performance degradation against various image corruption. Although slightly unsatisfactory results remain in specific edges or regions, SAM 2's robust adaptability to 1-point prompts underscores its potential for downstream surgical tasks with limited prompt requirements.




Abstract:Continuum robots can be miniaturized to just a few millimeters in diameter. Among these, notched tubular continuum robots (NTCR) show great potential in many delicate applications. Existing works in robotic modeling focus on kinematics and dynamics but still face challenges in reproducing the robot's morphology -- a significant factor that can expand the research landscape of continuum robots, especially for those with asymmetric continuum structures. This paper proposes a dual stereo vision-based method for the three-dimensional morphological reconstruction of millimeter-scale NTCRs. The method employs two oppositely located stationary binocular cameras to capture the point cloud of the NTCR, then utilizes predefined geometry as a reference for the KD tree method to relocate the capture point clouds, resulting in a morphologically correct NTCR despite the low-quality raw point cloud collection. The method has been proved feasible for an NTCR with a 3.5 mm diameter, capturing 14 out of 16 notch features, with the measurements generally centered around the standard of 1.5 mm, demonstrating the capability of revealing morphological details. Our proposed method paves the way for 3D morphological reconstruction of millimeter-scale soft robots for further self-modeling study.




Abstract:The recent advance in neural rendering has enabled the ability to reconstruct high-quality 4D scenes using neural networks. Although 4D neural reconstruction is popular, registration for such representations remains a challenging task, especially for dynamic scene registration in surgical planning and simulation. In this paper, we propose a novel strategy for dynamic surgical neural scene registration. We first utilize 4D Gaussian Splatting to represent the surgical scene and capture both static and dynamic scenes effectively. Then, a spatial aware feature aggregation method, Spatially Weight Cluttering (SWC) is proposed to accurately align the feature between surgical scenes, enabling precise and realistic surgical simulations. Lastly, we present a novel strategy of deformable scene registration to register two dynamic scenes. By incorporating both spatial and temporal information for correspondence matching, our approach achieves superior performance compared to existing registration methods for implicit neural representation. The proposed method has the potential to improve surgical planning and training, ultimately leading to better patient outcomes.
Abstract:Surgical instrument segmentation is crucial in surgical scene understanding, thereby facilitating surgical safety. Existing algorithms directly detected all instruments of pre-defined categories in the input image, lacking the capability to segment specific instruments according to the surgeon's intention. During different stages of surgery, surgeons exhibit varying preferences and focus toward different surgical instruments. Therefore, an instrument segmentation algorithm that adheres to the surgeon's intention can minimize distractions from irrelevant instruments and assist surgeons to a great extent. The recent Segment Anything Model (SAM) reveals the capability to segment objects following prompts, but the manual annotations for prompts are impractical during the surgery. To address these limitations in operating rooms, we propose an audio-driven surgical instrument segmentation framework, named ASI-Seg, to accurately segment the required surgical instruments by parsing the audio commands of surgeons. Specifically, we propose an intention-oriented multimodal fusion to interpret the segmentation intention from audio commands and retrieve relevant instrument details to facilitate segmentation. Moreover, to guide our ASI-Seg segment of the required surgical instruments, we devise a contrastive learning prompt encoder to effectively distinguish the required instruments from the irrelevant ones. Therefore, our ASI-Seg promotes the workflow in the operating rooms, thereby providing targeted support and reducing the cognitive load on surgeons. Extensive experiments are performed to validate the ASI-Seg framework, which reveals remarkable advantages over classical state-of-the-art and medical SAMs in both semantic segmentation and intention-oriented segmentation. The source code is available at https://github.com/Zonmgin-Zhang/ASI-Seg.




Abstract:Sampling-based kinodynamic motion planners (SKMPs) are powerful in finding collision-free trajectories for high-dimensional systems under differential constraints. Time-informed set (TIS) can provide the heuristic search domain to accelerate their convergence to the time-optimal solution. However, existing TIS approximation methods suffer from the curse of dimensionality, computational burden, and limited system applicable scope, e.g., linear and polynomial nonlinear systems. To overcome these problems, we propose a method by leveraging deep learning technology, Koopman operator theory, and random set theory. Specifically, we propose a Deep Invertible Koopman operator with control U model named DIKU to predict states forward and backward over a long horizon by modifying the auxiliary network with an invertible neural network. A sampling-based approach, ASKU, performing reachability analysis for the DIKU is developed to approximate the TIS of nonlinear control systems online. Furthermore, we design an online time-informed SKMP using a direct sampling technique to draw uniform random samples in the TIS. Simulation experiment results demonstrate that our method outperforms other existing works, approximating TIS in near real-time and achieving superior planning performance in several time-optimal kinodynamic motion planning problems.




Abstract:Wireless Capsule Endoscopy (WCE) is highly valued for its non-invasive and painless approach, though its effectiveness is compromised by uneven illumination from hardware constraints and complex internal dynamics, leading to overexposed or underexposed images. While researchers have discussed the challenges of low-light enhancement in WCE, the issue of correcting for different exposure levels remains underexplored. To tackle this, we introduce EndoUIC, a WCE unified illumination correction solution using an end-to-end promptable diffusion transformer (DFT) model. In our work, the illumination prompt module shall navigate the model to adapt to different exposure levels and perform targeted image enhancement, in which the Adaptive Prompt Integration (API) and Global Prompt Scanner (GPS) modules shall further boost the concurrent representation learning between the prompt parameters and features. Besides, the U-shaped restoration DFT model shall capture the long-range dependencies and contextual information for unified illumination restoration. Moreover, we present a novel Capsule-endoscopy Exposure Correction (CEC) dataset, including ground-truth and corrupted image pairs annotated by expert photographers. Extensive experiments against a variety of state-of-the-art (SOTA) methods on four datasets showcase the effectiveness of our proposed method and components in WCE illumination restoration, and the additional downstream experiments further demonstrate its utility for clinical diagnosis and surgical assistance.




Abstract:As the significance of simulation in medical care and intervention continues to grow, it is anticipated that a simplified and low-cost platform can be set up to execute personalized diagnoses and treatments. 3D Slicer can not only perform medical image analysis and visualization but can also provide surgical navigation and surgical planning functions. In this paper, we have chosen 3D Slicer as our base platform and monocular cameras are used as sensors. Then, We used the neural radiance fields (NeRF) algorithm to complete the 3D model reconstruction of the human head. We compared the accuracy of the NeRF algorithm in generating 3D human head scenes and utilized the MarchingCube algorithm to generate corresponding 3D mesh models. The individual's head pose, obtained through single-camera vision, is transmitted in real-time to the scene created within 3D Slicer. The demonstrations presented in this paper include real-time synchronization of transformations between the human head model in the 3D Slicer scene and the detected head posture. Additionally, we tested a scene where a tool, marked with an ArUco Maker tracked by a single camera, synchronously points to the real-time transformation of the head posture. These demos indicate that our methodology can provide a feasible real-time simulation platform for nasopharyngeal swab collection or intubation.
Abstract:Foundation models have become prominent in computer vision, achieving notable success in various tasks. However, their effectiveness largely depends on pre-training with extensive datasets. Applying foundation models directly to small datasets of capsule endoscopy images from scratch is challenging. Pre-training on broad, general vision datasets is crucial for successfully fine-tuning our model for specific tasks. In this work, we introduce a simplified approach called Adapt foundation models with a low-rank adaptation (LoRA) technique for easier customization. Our method, inspired by the DINOv2 foundation model, applies low-rank adaptation learning to tailor foundation models for capsule endoscopy diagnosis effectively. Unlike traditional fine-tuning methods, our strategy includes LoRA layers designed to absorb specific surgical domain knowledge. During the training process, we keep the main model (the backbone encoder) fixed and focus on optimizing the LoRA layers and the disease classification component. We tested our method on two publicly available datasets for capsule endoscopy disease classification. The results were impressive, with our model achieving 97.75% accuracy on the Kvasir-Capsule dataset and 98.81% on the Kvasirv2 dataset. Our solution demonstrates that foundation models can be adeptly adapted for capsule endoscopy diagnosis, highlighting that mere reliance on straightforward fine-tuning or pre-trained models from general computer vision tasks is inadequate for such specific applications.