Abstract:Accurate vision-based navigation in monocular endoscopy is difficult due to limited depth cues, weak tissue texture, non-rigid deformation, and substantial appearance variation across domains, all of which complicate pose estimation, depth prediction, and image-to-anatomy alignment. Although recent vision foundation models have shown promise, their learned representations often remain insufficiently geometry-consistent, hindering stable feature correspondence and limiting their reliability for downstream navigation tasks. We propose a unified framework for learning geometry-consistent and domain-robust image representations for monocular endoscopy. The framework combines a synthetic data pipeline that provides accurate geometric supervision with Hierarchy-Aware Geometry-Semantic Adaptation, a structured alternative to standard LoRA that inserts low-rank adapters selectively across the transformer hierarchy and couples them with layer-wise training objectives to encourage geometric correspondence in intermediate features and semantic consistency in deeper features. Experiments on public and proprietary datasets show improved geometric and semantic representation quality, leading to better performance on downstream navigation tasks including pose estimation and monocular depth estimation. The learned representations show favorable synthetic-to-real transfer on clinical bronchoscopy and provide a useful initialization for adaptation to sinus endoscopy and colonoscopy under limited supervision. The framework also shows favorable scaling with model size and training data. These results support hierarchy-aware, geometry-guided adaptation as a practical approach for endoscopic representation learning.




Abstract:Accurate intra-operative localization of the bronchoscope tip relative to patient anatomy remains challenging due to respiratory motion, anatomical variability, and CT-to-body divergence that cause deformation and misalignment between intra-operative views and pre-operative CT. Existing vision-based methods often fail to generalize across domains and patients, leading to residual alignment errors. This work establishes a generalizable foundation for bronchoscopy navigation through a robust vision-based framework and a new synthetic benchmark dataset that enables standardized and reproducible evaluation. We propose a vision-based pose optimization framework for frame-wise 2D-3D registration between intra-operative endoscopic views and pre-operative CT anatomy. A fine-tuned modality- and domain-invariant encoder enables direct similarity computation between real endoscopic RGB frames and CT-rendered depth maps, while a differentiable rendering module iteratively refines camera poses through depth consistency. To enhance reproducibility, we introduce the first public synthetic benchmark dataset for bronchoscopy navigation, addressing the lack of paired CT-endoscopy data. Trained exclusively on synthetic data distinct from the benchmark, our model achieves an average translational error of 2.65 mm and a rotational error of 0.19 rad, demonstrating accurate and stable localization. Qualitative results on real patient data further confirm strong cross-domain generalization, achieving consistent frame-wise 2D-3D alignment without domain-specific adaptation. Overall, the proposed framework achieves robust, domain-invariant localization through iterative vision-based optimization, while the new benchmark provides a foundation for standardized progress in vision-based bronchoscopy navigation.




Abstract:Image-guided surgery collocates patient-specific data with the physical environment to facilitate surgical decision making in real-time. Unfortunately, these guidance systems commonly become compromised by intraoperative soft-tissue deformations. Nonrigid image-to-physical registration methods have been proposed to compensate for these deformations, but intraoperative clinical utility requires compatibility of these techniques with data sparsity and temporal constraints in the operating room. While linear elastic finite element models are effective in sparse data scenarios, the computation time for finite element simulation remains a limitation to widespread deployment. This paper proposes a registration algorithm that uses regularized Kelvinlets, which are analytical solutions to linear elasticity in an infinite domain, to overcome these barriers. This algorithm is demonstrated and compared to finite element-based registration on two datasets: a phantom dataset representing liver deformations and an in vivo dataset representing breast deformations. The regularized Kelvinlets algorithm resulted in a significant reduction in computation time compared to the finite element method. Accuracy as evaluated by target registration error was comparable between both methods. Average target registration errors were 4.6 +/- 1.0 and 3.2 +/- 0.8 mm on the liver dataset and 5.4 +/- 1.4 and 6.4 +/- 1.5 mm on the breast dataset for the regularized Kelvinlets and finite element method models, respectively. This work demonstrates the generalizability of using a regularized Kelvinlets registration algorithm on multiple soft tissue elastic organs. This method may improve and accelerate registration for image-guided surgery applications, and it shows the potential of using regularized Kelvinlets solutions on medical imaging data.