Institute of Medical Robotics, School of Biomedical Engineering, Shanghai Jiao Tong University
Abstract:Gastrointestinal diseases impose a growing global health burden, and endoscopy is a primary tool for early diagnosis. However, routine endoscopic image interpretation still suffers from missed lesions and limited efficiency. Although AI-assisted diagnosis has shown promise, existing models often lack generalizability, adaptability, robustness, and scalability because of limited medical data, domain shift, and heterogeneous annotations. To address these challenges, we develop RATNet, a foundation model for gastrointestinal endoscopy imaging based on analogical reasoning. RATNet acquires and transfers knowledge from heterogeneous expert annotations across five gastrointestinal endoscopy datasets through a cyclic pre-training strategy. Its architecture consists of an encoder, a relevance-knowledge acquisition and transfer (RAT) module, a projector, and a multi-task head, and supports fine-tuning, linear probing, and zero-shot transfer. Evaluations show that RATNet outperforms existing foundation models, including GastroNet and GastroVision, across six scenarios: diagnosis of common gastrointestinal diseases, few-shot learning for rare diseases, zero-shot transfer to new medical sites, robustness under long-tailed disease distributions, adaptation to novel diseases, and privacy-preserving deployment via federated learning. Its advantage comes from an analogical reasoning mechanism that matches image-derived posterior knowledge to a learned prior knowledge base and transfers relative knowledge to guide diagnosis, improving generalization and resistance to bias. RATNet is open and cost-effective, supports automatic integration of heterogeneous annotations without manual label unification, and reduces data acquisition costs, making it a practical foundation for intelligent gastrointestinal diagnosis, especially in resource-limited settings.




Abstract:Object tracking is divided into single-object tracking (SOT) and multi-object tracking (MOT). MOT aims to maintain the identities of multiple objects across a series of continuous video sequences. In recent years, MOT has made rapid progress. However, modeling the motion and appearance models of objects in complex scenes still faces various challenging issues. In this paper, we design a novel direction consistency method for smooth trajectory prediction (STP-DC) to increase the modeling of motion information and overcome the lack of robustness in previous methods in complex scenes. Existing methods use pedestrian re-identification (Re-ID) to model appearance, however, they extract more background information which lacks discriminability in occlusion and crowded scenes. We propose a hyper-grain feature embedding network (HG-FEN) to enhance the modeling of appearance models, thus generating robust appearance descriptors. We also proposed other robustness techniques, including CF-ECM for storing robust appearance information and SK-AS for improving association accuracy. To achieve state-of-the-art performance in MOT, we propose a robust tracker named Rt-track, incorporating various tricks and techniques. It achieves 79.5 MOTA, 76.0 IDF1 and 62.1 HOTA on the test set of MOT17.Rt-track also achieves 77.9 MOTA, 78.4 IDF1 and 63.3 HOTA on MOT20, surpassing all published methods.




Abstract:As an essential processing step before the fusing of infrared and visible images, the performance of image registration determines whether the two images can be fused at correct spatial position. In the actual scenario, the varied imaging devices may lead to a change in perspective or time gap between shots, making significant non-rigid spatial relationship in infrared and visible images. Even if a large number of feature points are matched, the registration accuracy may still be inadequate, affecting the result of image fusion and other vision tasks. To alleviate this problem, we propose a Semantic-Aware on-Demand registration network (SA-DNet), which mainly purpose is to confine the feature matching process to the semantic region of interest (sROI) by designing semantic-aware module (SAM) and HOL-Deep hybrid matching module (HDM). After utilizing TPS to transform infrared and visible images based on the corresponding feature points in sROI, the registered images are fused using image fusion module (IFM) to achieve a fully functional registration and fusion network. Moreover, we point out that for different demands, this type of approach allows us to select semantic objects for feature matching as needed and accomplishes task-specific registration based on specific requirements. To demonstrate the robustness of SA-DNet for non-rigid distortions, we conduct extensive experiments by comparing SA-DNet with five state-of-the-art infrared and visible image feature matching methods, and the experimental results show that our method adapts better to the presence of non-rigid distortions in the images and provides semantically well-registered images.