Abstract:Registration between preoperative CT and intraoperative laparoscopic video plays a crucial role in augmented reality (AR) guidance for minimally invasive surgery. Learning-based methods have recently achieved registration errors comparable to optimization-based approaches while offering faster inference. However, many supervised methods produce coarse alignments that rely on additional optimization-based refinement, thereby increasing inference time. We present a discrete-action reinforcement learning (RL) framework that formulates CT-to-video registration as a sequential decision-making process. A shared feature encoder, warm-started from a supervised pose estimation network to provide stable geometric features and faster convergence, extracts representations from CT renderings and laparoscopic frames, while an RL policy head learns to choose rigid transformations along six degrees of freedom and to decide when to stop the iteration. Experiments on a public laparoscopic dataset demonstrated that our method achieved an average target registration error (TRE) of 15.70 mm, comparable to supervised approaches with optimization, while achieving faster convergence. The proposed RL-based formulation enables automated, efficient iterative registration without manually tuned step sizes or stopping criteria. This discrete framework provides a practical foundation for future continuous-action and deformable registration models in surgical AR applications.




Abstract:Contrastive learning, a prominent approach within self-supervised learning, has demonstrated significant effectiveness in developing generalizable models for various applications involving natural images. However, recent research indicates that these successes do not necessarily extend to the medical imaging domain. In this paper, we investigate the reasons for this suboptimal performance and hypothesize that the dense distribution of medical images poses challenges to the pretext tasks in contrastive learning, particularly in constructing positive and negative pairs. We explore model performance under different augmentation strategies and compare the results to those achieved with strong augmentations. Our study includes six publicly available datasets covering multiple clinically relevant tasks. We further assess the model's generalizability through external evaluations. The model pre-trained with weak augmentation outperforms those with strong augmentation, improving AUROC from 0.838 to 0.848 and AUPR from 0.523 to 0.597 on MESSIDOR2, and showing similar enhancements across other datasets. Our findings suggest that optimizing the scale of augmentation is critical for enhancing the efficacy of contrastive learning in medical imaging.