Southeast University, China
Abstract:With the advancement of edge computing, federated learning (FL) displays a bright promise as a privacy-preserving collaborative learning paradigm. However, one major challenge for FL is the data heterogeneity issue, which refers to the biased labeling preferences among multiple clients, negatively impacting convergence and model performance. Most previous FL methods attempt to tackle the data heterogeneity issue locally or globally, neglecting underlying class-wise structure information contained in each client. In this paper, we first study how data heterogeneity affects the divergence of the model and decompose it into local, global, and sampling drift sub-problems. To explore the potential of using intra-client class-wise structural knowledge in handling these drifts, we thus propose Federated Learning with Structural Knowledge Collaboration (FedSKC). The key idea of FedSKC is to extract and transfer domain preferences from inter-client data distributions, offering diverse class-relevant knowledge and a fair convergent signal. FedSKC comprises three components: i) local contrastive learning, to prevent weight divergence resulting from local training; ii) global discrepancy aggregation, which addresses the parameter deviation between the server and clients; iii) global period review, correcting for the sampling drift introduced by the server randomly selecting devices. We have theoretically analyzed FedSKC under non-convex objectives and empirically validated its superiority through extensive experimental results.
Abstract:Background: Accurate assessment of metastatic burden in axillary lymph nodes is crucial for guiding breast cancer treatment decisions, yet conventional imaging modalities struggle to differentiate metastatic burden levels and capture comprehensive lymph node characteristics. This study leverages dual-energy computed tomography (DECT) to exploit spectral-spatial information for improved multi-class classification. Purpose: To develop a noninvasive DECT-based model classifying sentinel lymph nodes into three categories: no metastasis ($N_0$), low metastatic burden ($N_{+(1-2)}$), and heavy metastatic burden ($N_{+(\geq3)}$), thereby aiding therapeutic planning. Methods: We propose a novel space-squeeze method combining two innovations: (1) a channel-wise attention mechanism to compress and recalibrate spectral-spatial features across 11 energy levels, and (2) virtual class injection to sharpen inter-class boundaries and compact intra-class variations in the representation space. Results: Evaluated on 227 biopsy-confirmed cases, our method achieved an average test AUC of 0.86 (95% CI: 0.80-0.91) across three cross-validation folds, outperforming established CNNs (VGG, ResNet, etc). The channel-wise attention and virtual class components individually improved AUC by 5.01% and 5.87%, respectively, demonstrating complementary benefits. Conclusions: The proposed framework enhances diagnostic AUC by effectively integrating DECT's spectral-spatial data and mitigating class ambiguity, offering a promising tool for noninvasive metastatic burden assessment in clinical practice.
Abstract:Language-Guided object recognition in remote sensing imagery is crucial for large-scale mapping and automated data annotation. However, existing open-vocabulary and visual grounding methods rely on explicit category cues, limiting their ability to handle complex or implicit queries that require advanced reasoning. To address this issue, we introduce a new suite of tasks, including Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS), covering open-vocabulary, open-ended, and open-subclass scenarios. We further present EarthInstruct, the first InstructCDS benchmark for earth observation. It is constructed from two diverse remote sensing datasets with varying spatial resolutions and annotation rules across 20 categories, necessitating models to interpret dataset-specific instructions. Given the scarcity of semantically rich labeled data in remote sensing, we propose InstructSAM, a training-free framework for instruction-driven object recognition. InstructSAM leverages large vision-language models to interpret user instructions and estimate object counts, employs SAM2 for mask proposal, and formulates mask-label assignment as a binary integer programming problem. By integrating semantic similarity with counting constraints, InstructSAM efficiently assigns categories to predicted masks without relying on confidence thresholds. Experiments demonstrate that InstructSAM matches or surpasses specialized baselines across multiple tasks while maintaining near-constant inference time regardless of object count, reducing output tokens by 89% and overall runtime by over 32% compared to direct generation approaches. We believe the contributions of the proposed tasks, benchmark, and effective approach will advance future research in developing versatile object recognition systems.
Abstract:Medical AI assistants support doctors in disease diagnosis, medical image analysis, and report generation. However, they still face significant challenges in clinical use, including limited accuracy with multimodal content and insufficient validation in real-world settings. We propose RCMed, a full-stack AI assistant that improves multimodal alignment in both input and output, enabling precise anatomical delineation, accurate localization, and reliable diagnosis through hierarchical vision-language grounding. A self-reinforcing correlation mechanism allows visual features to inform language context, while language semantics guide pixel-wise attention, forming a closed loop that refines both modalities. This correlation is enhanced by a color region description strategy, translating anatomical structures into semantically rich text to learn shape-location-text relationships across scales. Trained on 20 million image-mask-description triplets, RCMed achieves state-of-the-art precision in contextualizing irregular lesions and subtle anatomical boundaries, excelling in 165 clinical tasks across 9 modalities. It achieved a 23.5% relative improvement in cell segmentation from microscopy images over prior methods. RCMed's strong vision-language alignment enables exceptional generalization, with state-of-the-art performance in external validation across 20 clinically significant cancer types, including novel tasks. This work demonstrates how integrated multimodal models capture fine-grained patterns, enabling human-level interpretation in complex scenarios and advancing human-centric AI healthcare.
Abstract:Effective traffic signal control (TSC) is crucial in mitigating urban congestion and reducing emissions. Recently, reinforcement learning (RL) has been the research trend for TSC. However, existing RL algorithms face several real-world challenges that hinder their practical deployment in TSC: (1) Sensor accuracy deteriorates with increased sensor detection range, and data transmission is prone to noise, potentially resulting in unsafe TSC decisions. (2) During the training of online RL, interactions with the environment could be unstable, potentially leading to inappropriate traffic signal phase (TSP) selection and traffic congestion. (3) Most current TSC algorithms focus only on TSP decisions, overlooking the critical aspect of phase duration, affecting safety and efficiency. To overcome these challenges, we propose a robust two-stage fuzzy approach called FuzzyLight, which integrates compressed sensing and RL for TSC deployment. FuzzyLight offers several key contributions: (1) It employs fuzzy logic and compressed sensing to address sensor noise and enhances the efficiency of TSP decisions. (2) It maintains stable performance during training and combines fuzzy logic with RL to generate precise phases. (3) It works in real cities across 22 intersections and demonstrates superior performance in both real-world and simulated environments. Experimental results indicate that FuzzyLight enhances traffic efficiency by 48% compared to expert-designed timings in the real world. Furthermore, it achieves state-of-the-art (SOTA) performance in simulated environments using six real-world datasets with transmission noise. The code and deployment video are available at the URL1
Abstract:In China, receptionist nurses face overwhelming workloads in outpatient settings, limiting their time and attention for each patient and ultimately reducing service quality. In this paper, we present the Personalized Intelligent Outpatient Reception System (PIORS). This system integrates an LLM-based reception nurse and a collaboration between LLM and hospital information system (HIS) into real outpatient reception setting, aiming to deliver personalized, high-quality, and efficient reception services. Additionally, to enhance the performance of LLMs in real-world healthcare scenarios, we propose a medical conversational data generation framework named Service Flow aware Medical Scenario Simulation (SFMSS), aiming to adapt the LLM to the real-world environments and PIORS settings. We evaluate the effectiveness of PIORS and SFMSS through automatic and human assessments involving 15 users and 15 clinical experts. The results demonstrate that PIORS-Nurse outperforms all baselines, including the current state-of-the-art model GPT-4o, and aligns with human preferences and clinical needs. Further details and demo can be found at https://github.com/FudanDISC/PIORS
Abstract:Several medical Multimodal Large Languange Models (MLLMs) have been developed to address tasks involving visual images with textual instructions across various medical modalities, achieving impressive results. Most current medical generalist models are region-agnostic, treating the entire image as a holistic representation. However, they struggle to identify which specific regions they are focusing on when generating a sentence.To mimic the behavior of doctors, who typically begin by reviewing the entire image before concentrating on specific regions for a thorough evaluation, we aim to enhance the capability of medical MLLMs in understanding anatomical regions within entire medical scans. To achieve it, we first formulate Region-Centric tasks and construct a large-scale dataset, MedRegInstruct, to incorporate regional information into training. Combining our collected dataset with other medical multimodal corpora for training, we propose a Region-Aware medical MLLM, MedRegA, which is the first bilingual generalist medical AI system to simultaneously handle image-level and region-level medical vision-language tasks across a broad range of modalities. Our MedRegA not only enables three region-centric tasks, but also achieves the best performance for visual question answering, report generation and medical image classification over 8 modalities, showcasing significant versatility. Experiments demonstrate that our model can not only accomplish powerful performance across various medical vision-language tasks in bilingual settings, but also recognize and detect structures in multimodal medical scans, boosting the interpretability and user interactivity of medical MLLMs. Our project page is https://medrega.github.io.
Abstract:Radiology is a vital and complex component of modern clinical workflow and covers many tasks. Recently, vision-language (VL) foundation models in medicine have shown potential in processing multimodal information, offering a unified solution for various radiology tasks. However, existing studies either pre-trained VL models on natural data or did not fully integrate vision-language architecture and pretraining, often neglecting the unique multimodal complexity in radiology images and their textual contexts. Additionally, their practical applicability in real-world scenarios remains underexplored. Here, we present RadFound, a large and open-source vision-language foundation model tailored for radiology, that is trained on the most extensive dataset of over 8.1 million images and 250,000 image-text pairs, covering 19 major organ systems and 10 imaging modalities. To establish expert-level multimodal perception and generation capabilities, RadFound introduces an enhanced vision encoder to capture intra-image local features and inter-image contextual information, and a unified cross-modal learning design tailored to radiology. To fully assess the models' capability, we construct a benchmark, RadVLBench, including radiology interpretation tasks like medical vision-language question-answering, as well as text generation tasks ranging from captioning to report generation. We also propose a human evaluation framework. When evaluated on the real-world benchmark involving three representative modalities, 2D images (chest X-rays), multi-view images (mammograms), and 3D images (thyroid CT scans), RadFound significantly outperforms other VL foundation models on both quantitative metrics and human evaluation. In summary, the development of RadFound represents an advancement in radiology generalists, demonstrating broad applicability potential for integration into clinical workflows.
Abstract:Blind iris images, which result from unknown degradation during the process of iris recognition at long distances, often lead to decreased iris recognition rates. Currently, little existing literature offers a solution to this problem. In response, we propose a prior embedding-driven architecture for long distance blind iris recognition. We first proposed a blind iris image restoration network called Iris-PPRGAN. To effectively restore the texture of the blind iris, Iris-PPRGAN includes a Generative Adversarial Network (GAN) used as a Prior Decoder, and a DNN used as the encoder. To extract iris features more efficiently, we then proposed a robust iris classifier by modifying the bottleneck module of InsightFace, which called Insight-Iris. A low-quality blind iris image is first restored by Iris-PPRGAN, then the restored iris image undergoes recognition via Insight-Iris. Experimental results on the public CASIA-Iris-distance dataset demonstrate that our proposed method significantly superior results to state-of-the-art blind iris restoration methods both quantitatively and qualitatively, Specifically, the recognition rate for long-distance blind iris images reaches 90% after processing with our methods, representing an improvement of approximately ten percentage points compared to images without restoration.
Abstract:Cryo-Electron Tomography (cryo-ET) is a 3D imaging technology facilitating the study of macromolecular structures at near-atomic resolution. Recent volumetric segmentation approaches on cryo-ET images have drawn widespread interest in biological sector. However, existing methods heavily rely on manually labeled data, which requires highly professional skills, thereby hindering the adoption of fully-supervised approaches for cryo-ET images. Some unsupervised domain adaptation (UDA) approaches have been designed to enhance the segmentation network performance using unlabeled data. However, applying these methods directly to cryo-ET images segmentation tasks remains challenging due to two main issues: 1) the source data, usually obtained through simulation, contain a certain level of noise, while the target data, directly collected from raw-data from real-world scenario, have unpredictable noise levels. 2) the source data used for training typically consists of known macromoleculars, while the target domain data are often unknown, causing the model's segmenter to be biased towards these known macromolecules, leading to a domain shift problem. To address these challenges, in this work, we introduce the first voxel-wise unsupervised domain adaptation approach, termed Vox-UDA, specifically for cryo-ET subtomogram segmentation. Vox-UDA incorporates a noise generation module to simulate target-like noises in the source dataset for cross-noise level adaptation. Additionally, we propose a denoised pseudo-labeling strategy based on improved Bilateral Filter to alleviate the domain shift problem. Experimental results on both simulated and real cryo-ET subtomogram datasets demonstrate the superiority of our proposed approach compared to state-of-the-art UDA methods.