Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yueming Jin

Dealing with All-stage Missing Modality: Towards A Universal Model with Robust Reconstruction and Personalization

Jun 04, 2024

Yunpeng Zhao, Cheng Chen, Qing You Pang, Quanzheng Li, Carol Tang, Beng-Ti Ang, Yueming Jin

Figure 1 for Dealing with All-stage Missing Modality: Towards A Universal Model with Robust Reconstruction and Personalization

Figure 2 for Dealing with All-stage Missing Modality: Towards A Universal Model with Robust Reconstruction and Personalization

Figure 3 for Dealing with All-stage Missing Modality: Towards A Universal Model with Robust Reconstruction and Personalization

Figure 4 for Dealing with All-stage Missing Modality: Towards A Universal Model with Robust Reconstruction and Personalization

Abstract:Addressing missing modalities presents a critical challenge in multimodal learning. Current approaches focus on developing models that can handle modality-incomplete inputs during inference, assuming that the full set of modalities are available for all the data during training. This reliance on full-modality data for training limits the use of abundant modality-incomplete samples that are often encountered in practical settings. In this paper, we propose a robust universal model with modality reconstruction and model personalization, which can effectively tackle the missing modality at both training and testing stages. Our method leverages a multimodal masked autoencoder to reconstruct the missing modality and masked patches simultaneously, incorporating an innovative distribution approximation mechanism to fully utilize both modality-complete and modality-incomplete data. The reconstructed modalities then contributes to our designed data-model co-distillation scheme to guide the model learning in the presence of missing modalities. Moreover, we propose a CLIP-driven hyper-network to personalize partial model parameters, enabling the model to adapt to each distinct missing modality scenario. Our method has been extensively validated on two brain tumor segmentation benchmarks. Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches under the all-stage missing modality settings with different missing ratios. Code will be available.

Via

Access Paper or Ask Questions

MGI: Multimodal Contrastive pre-training of Genomic and Medical Imaging

Jun 02, 2024

Jiaying Zhou, Mingzhou Jiang, Junde Wu, Jiayuan Zhu, Ziyue Wang, Yueming Jin

Figure 1 for MGI: Multimodal Contrastive pre-training of Genomic and Medical Imaging

Figure 2 for MGI: Multimodal Contrastive pre-training of Genomic and Medical Imaging

Figure 3 for MGI: Multimodal Contrastive pre-training of Genomic and Medical Imaging

Abstract:Medicine is inherently a multimodal discipline. Medical images can reflect the pathological changes of cancer and tumors, while the expression of specific genes can influence their morphological characteristics. However, most deep learning models employed for these medical tasks are unimodal, making predictions using either image data or genomic data exclusively. In this paper, we propose a multimodal pre-training framework that jointly incorporates genomics and medical images for downstream tasks. To address the issues of high computational complexity and difficulty in capturing long-range dependencies in genes sequence modeling with MLP or Transformer architectures, we utilize Mamba to model these long genomic sequences. We aligns medical images and genes using a self-supervised contrastive learning approach which combines the Mamba as a genetic encoder and the Vision Transformer (ViT) as a medical image encoder. We pre-trained on the TCGA dataset using paired gene expression data and imaging data, and fine-tuned it for downstream tumor segmentation tasks. The results show that our model outperformed a wide range of related methods.

Via

Access Paper or Ask Questions

Deform3DGS: Flexible Deformation for Fast Surgical Scene Reconstruction with Gaussian Splatting

May 29, 2024

Shuojue Yang, Qian Li, Daiyun Shen, Bingchen Gong, Qi Dou, Yueming Jin

Figure 1 for Deform3DGS: Flexible Deformation for Fast Surgical Scene Reconstruction with Gaussian Splatting

Figure 2 for Deform3DGS: Flexible Deformation for Fast Surgical Scene Reconstruction with Gaussian Splatting

Figure 3 for Deform3DGS: Flexible Deformation for Fast Surgical Scene Reconstruction with Gaussian Splatting

Figure 4 for Deform3DGS: Flexible Deformation for Fast Surgical Scene Reconstruction with Gaussian Splatting

Abstract:Tissue deformation poses a key challenge for accurate surgical scene reconstruction. Despite yielding high reconstruction quality, existing methods suffer from slow rendering speeds and long training times, limiting their intraoperative applicability. Motivated by recent progress in 3D Gaussian Splatting, an emerging technology in real-time 3D rendering, this work presents a novel fast reconstruction framework, termed Deform3DGS, for deformable tissues during endoscopic surgery. Specifically, we introduce 3D GS into surgical scenes by integrating a point cloud initialization to improve reconstruction. Furthermore, we propose a novel flexible deformation modeling scheme (FDM) to learn tissue deformation dynamics at the level of individual Gaussians. Our FDM can model the surface deformation with efficient representations, allowing for real-time rendering performance. More importantly, FDM significantly accelerates surgical scene reconstruction, demonstrating considerable clinical values, particularly in intraoperative settings where time efficiency is crucial. Experiments on DaVinci robotic surgery videos indicate the efficacy of our approach, showcasing superior reconstruction fidelity PSNR: (37.90) and rendering speed (338.8 FPS) while substantially reducing training time to only 1 minute/scene.

* 10 pages, 2 figures, conference paper

Via

Access Paper or Ask Questions

Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms

Apr 14, 2024

Diandian Guo, Manxi Lin, Jialun Pei, He Tang, Yueming Jin, Pheng-Ann Heng

Figure 1 for Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms

Figure 2 for Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms

Figure 3 for Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms

Figure 4 for Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms

Abstract:A comprehensive understanding of surgical scenes allows for monitoring of the surgical process, reducing the occurrence of accidents and enhancing efficiency for medical professionals. Semantic modeling within operating rooms, as a scene graph generation (SGG) task, is challenging since it involves consecutive recognition of subtle surgical actions over prolonged periods. To address this challenge, we propose a Tri-modal (i.e., images, point clouds, and language) confluence with Temporal dynamics framework, termed TriTemp-OR. Diverging from previous approaches that integrated temporal information via memory graphs, our method embraces two advantages: 1) we directly exploit bi-modal temporal information from the video streaming for hierarchical feature interaction, and 2) the prior knowledge from Large Language Models (LLMs) is embedded to alleviate the class-imbalance problem in the operating theatre. Specifically, our model performs temporal interactions across 2D frames and 3D point clouds, including a scale-adaptive multi-view temporal interaction (ViewTemp) and a geometric-temporal point aggregation (PointTemp). Furthermore, we transfer knowledge from the biomedical LLM, LLaVA-Med, to deepen the comprehension of intraoperative relations. The proposed TriTemp-OR enables the aggregation of tri-modal features through relation-aware unification to predict relations so as to generate scene graphs. Experimental results on the 4D-OR benchmark demonstrate the superior performance of our model for long-term OR streaming.

* 10 pages, 4 figures, 3 tables

Via

Access Paper or Ask Questions

Uncertainty-Aware Adapter: Adapting Segment Anything Model (SAM) for Ambiguous Medical Image Segmentation

Mar 19, 2024

Mingzhou Jiang, Jiaying Zhou, Junde Wu, Tianyang Wang, Yueming Jin, Min Xu

Figure 1 for Uncertainty-Aware Adapter: Adapting Segment Anything Model (SAM) for Ambiguous Medical Image Segmentation

Figure 2 for Uncertainty-Aware Adapter: Adapting Segment Anything Model (SAM) for Ambiguous Medical Image Segmentation

Figure 3 for Uncertainty-Aware Adapter: Adapting Segment Anything Model (SAM) for Ambiguous Medical Image Segmentation

Figure 4 for Uncertainty-Aware Adapter: Adapting Segment Anything Model (SAM) for Ambiguous Medical Image Segmentation

Abstract:The Segment Anything Model (SAM) gained significant success in natural image segmentation, and many methods have tried to fine-tune it to medical image segmentation. An efficient way to do so is by using Adapters, specialized modules that learn just a few parameters to tailor SAM specifically for medical images. However, unlike natural images, many tissues and lesions in medical images have blurry boundaries and may be ambiguous. Previous efforts to adapt SAM ignore this challenge and can only predict distinct segmentation. It may mislead clinicians or cause misdiagnosis, especially when encountering rare variants or situations with low model confidence. In this work, we propose a novel module called the Uncertainty-aware Adapter, which efficiently fine-tuning SAM for uncertainty-aware medical image segmentation. Utilizing a conditional variational autoencoder, we encoded stochastic samples to effectively represent the inherent uncertainty in medical imaging. We designed a new module on a standard adapter that utilizes a condition-based strategy to interact with samples to help SAM integrate uncertainty. We evaluated our method on two multi-annotated datasets with different modalities: LIDC-IDRI (lung abnormalities segmentation) and REFUGE2 (optic-cup segmentation). The experimental results show that the proposed model outperforms all the previous methods and achieves the new state-of-the-art (SOTA) on both benchmarks. We also demonstrated that our method can generate diverse segmentation hypotheses that are more realistic as well as heterogeneous.

Via

Access Paper or Ask Questions

Not just Birds and Cars: Generic, Scalable and Explainable Models for Professional Visual Recognition

Mar 08, 2024

Junde Wu, Jiayuan Zhu, Min Xu, Yueming Jin

Figure 1 for Not just Birds and Cars: Generic, Scalable and Explainable Models for Professional Visual Recognition

Figure 2 for Not just Birds and Cars: Generic, Scalable and Explainable Models for Professional Visual Recognition

Figure 3 for Not just Birds and Cars: Generic, Scalable and Explainable Models for Professional Visual Recognition

Figure 4 for Not just Birds and Cars: Generic, Scalable and Explainable Models for Professional Visual Recognition

Abstract:Some visual recognition tasks are more challenging then the general ones as they require professional categories of images. The previous efforts, like fine-grained vision classification, primarily introduced models tailored to specific tasks, like identifying bird species or car brands with limited scalability and generalizability. This paper aims to design a scalable and explainable model to solve Professional Visual Recognition tasks from a generic standpoint. We introduce a biologically-inspired structure named Pro-NeXt and reveal that Pro-NeXt exhibits substantial generalizability across diverse professional fields such as fashion, medicine, and art-areas previously considered disparate. Our basic-sized Pro-NeXt-B surpasses all preceding task-specific models across 12 distinct datasets within 5 diverse domains. Furthermore, we find its good scaling property that scaling up Pro-NeXt in depth and width with increasing GFlops can consistently enhances its accuracy. Beyond scalability and adaptability, the intermediate features of Pro-NeXt achieve reliable object detection and segmentation performance without extra training, highlighting its solid explainability. We will release the code to foster further research in this area.

* 20 pages including reference. arXiv admin note: text overlap with arXiv:2211.15672

Via

Access Paper or Ask Questions

LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery

Feb 26, 2024

Kexin Chen, Yuyang Du, Tao You, Mobarakol Islam, Ziyu Guo, Yueming Jin, Guangyong Chen, Pheng-Ann Heng

Figure 1 for LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery

Figure 2 for LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery

Figure 3 for LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery

Figure 4 for LLM-Assisted Multi-Teacher Continual Learning for Visual Question Answering in Robotic Surgery

Abstract:Visual question answering (VQA) can be fundamentally crucial for promoting robotic-assisted surgical education. In practice, the needs of trainees are constantly evolving, such as learning more surgical types, adapting to different robots, and learning new surgical instruments and techniques for one surgery. Therefore, continually updating the VQA system by a sequential data stream from multiple resources is demanded in robotic surgery to address new tasks. In surgical scenarios, the storage cost and patient data privacy often restrict the availability of old data when updating the model, necessitating an exemplar-free continual learning (CL) setup. However, prior studies overlooked two vital problems of the surgical domain: i) large domain shifts from diverse surgical operations collected from multiple departments or clinical centers, and ii) severe data imbalance arising from the uneven presence of surgical instruments or activities during surgical procedures. This paper proposes to address these two problems with a multimodal large language model (LLM) and an adaptive weight assignment methodology. We first develop a new multi-teacher CL framework that leverages a multimodal LLM as the additional teacher. The strong generalization ability of the LLM can bridge the knowledge gap when domain shifts and data imbalances occur. We then put forth a novel data processing method that transforms complex LLM embeddings into logits compatible with our CL framework. We further design an adaptive weight assignment approach that balances the generalization ability of the LLM and the domain expertise of the old CL model. We construct a new dataset for surgical VQA tasks, providing valuable data resources for future research. Extensive experimental results on three datasets demonstrate the superiority of our method to other advanced CL models.

* This paper has been accapted by 2024 IEEE International Conference on Robotics and Automation (ICRA)

Via

Access Paper or Ask Questions

S^2Former-OR: Single-Stage Bimodal Transformer for Scene Graph Generation in OR

Feb 22, 2024

Jialun Pei, Diandian Guo, Jingyang Zhang, Manxi Lin, Yueming Jin, Pheng-Ann Heng

Figure 1 for S^2Former-OR: Single-Stage Bimodal Transformer for Scene Graph Generation in OR

Figure 2 for S^2Former-OR: Single-Stage Bimodal Transformer for Scene Graph Generation in OR

Figure 3 for S^2Former-OR: Single-Stage Bimodal Transformer for Scene Graph Generation in OR

Figure 4 for S^2Former-OR: Single-Stage Bimodal Transformer for Scene Graph Generation in OR

Abstract:Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR). However, previous works have primarily relied on the multi-stage learning that generates semantic scene graphs dependent on intermediate processes with pose estimation and object detection, which may compromise model efficiency and efficacy, also impose extra annotation burden. In this study, we introduce a novel single-stage bimodal transformer framework for SGG in the OR, termed S^2Former-OR, aimed to complementally leverage multi-view 2D scenes and 3D point clouds for SGG in an end-to-end manner. Concretely, our model embraces a View-Sync Transfusion scheme to encourage multi-view visual information interaction. Concurrently, a Geometry-Visual Cohesion operation is designed to integrate the synergic 2D semantic features into 3D point cloud features. Moreover, based on the augmented feature, we propose a novel relation-sensitive transformer decoder that embeds dynamic entity-pair queries and relational trait priors, which enables the direct prediction of entity-pair relations for graph generation without intermediate steps. Extensive experiments have validated the superior SGG performance and lower computational cost of S^2Former-OR on 4D-OR benchmark, compared with current OR-SGG methods, e.g., 3% Precision increase and 24.2M reduction in model parameters. We further compared our method with generic single-stage SGG methods with broader metrics for a comprehensive evaluation, with consistently better performance achieved. The code will be made available.

Via

Access Paper or Ask Questions

An objective comparison of methods for augmented reality in laparoscopic liver resection by preoperative-to-intraoperative image fusion

Feb 07, 2024

Sharib Ali, Yamid Espinel, Yueming Jin, Peng Liu, Bianca Güttner, Xukun Zhang, Lihua Zhang, Tom Dowrick, Matthew J. Clarkson, Shiting Xiao(+10 more)

Figure 1 for An objective comparison of methods for augmented reality in laparoscopic liver resection by preoperative-to-intraoperative image fusion

Figure 2 for An objective comparison of methods for augmented reality in laparoscopic liver resection by preoperative-to-intraoperative image fusion

Figure 3 for An objective comparison of methods for augmented reality in laparoscopic liver resection by preoperative-to-intraoperative image fusion

Figure 4 for An objective comparison of methods for augmented reality in laparoscopic liver resection by preoperative-to-intraoperative image fusion

Abstract:Augmented reality for laparoscopic liver resection is a visualisation mode that allows a surgeon to localise tumours and vessels embedded within the liver by projecting them on top of a laparoscopic image. Preoperative 3D models extracted from CT or MRI data are registered to the intraoperative laparoscopic images during this process. In terms of 3D-2D fusion, most of the algorithms make use of anatomical landmarks to guide registration. These landmarks include the liver's inferior ridge, the falciform ligament, and the occluding contours. They are usually marked by hand in both the laparoscopic image and the 3D model, which is time-consuming and may contain errors if done by a non-experienced user. Therefore, there is a need to automate this process so that augmented reality can be used effectively in the operating room. We present the Preoperative-to-Intraoperative Laparoscopic Fusion Challenge (P2ILF), held during the Medical Imaging and Computer Assisted Interventions (MICCAI 2022) conference, which investigates the possibilities of detecting these landmarks automatically and using them in registration. The challenge was divided into two tasks: 1) A 2D and 3D landmark detection task and 2) a 3D-2D registration task. The teams were provided with training data consisting of 167 laparoscopic images and 9 preoperative 3D models from 9 patients, with the corresponding 2D and 3D landmark annotations. A total of 6 teams from 4 countries participated, whose proposed methods were evaluated on 16 images and two preoperative 3D models from two patients. All the teams proposed deep learning-based methods for the 2D and 3D landmark segmentation tasks and differentiable rendering-based methods for the registration task. Based on the experimental outcomes, we propose three key hypotheses that determine current limitations and future directions for research in this domain.

* 24 pages

Via

Access Paper or Ask Questions

Video-Instrument Synergistic Network for Referring Video Instrument Segmentation in Robotic Surgery

Aug 18, 2023

Hongqiu Wang, Lei Zhu, Guang Yang, Yike Guo, Shichen Zhang, Bo Xu, Yueming Jin

Figure 1 for Video-Instrument Synergistic Network for Referring Video Instrument Segmentation in Robotic Surgery

Figure 2 for Video-Instrument Synergistic Network for Referring Video Instrument Segmentation in Robotic Surgery

Figure 3 for Video-Instrument Synergistic Network for Referring Video Instrument Segmentation in Robotic Surgery

Figure 4 for Video-Instrument Synergistic Network for Referring Video Instrument Segmentation in Robotic Surgery

Abstract:Robot-assisted surgery has made significant progress, with instrument segmentation being a critical factor in surgical intervention quality. It serves as the building block to facilitate surgical robot navigation and surgical education for the next generation of operating intelligence. Although existing methods have achieved accurate instrument segmentation results, they simultaneously generate segmentation masks for all instruments, without the capability to specify a target object and allow an interactive experience. This work explores a new task of Referring Surgical Video Instrument Segmentation (RSVIS), which aims to automatically identify and segment the corresponding surgical instruments based on the given language expression. To achieve this, we devise a novel Video-Instrument Synergistic Network (VIS-Net) to learn both video-level and instrument-level knowledge to boost performance, while previous work only used video-level information. Meanwhile, we design a Graph-based Relation-aware Module (GRM) to model the correlation between multi-modal information (i.e., textual description and video frame) to facilitate the extraction of instrument-level information. We are also the first to produce two RSVIS datasets to promote related research. Our method is verified on these datasets, and experimental results exhibit that the VIS-Net can significantly outperform existing state-of-the-art referring segmentation methods. Our code and our datasets will be released upon the publication of this work.

Via

Access Paper or Ask Questions