Abstract:The integration of large language models into healthcare necessitates a rigorous evaluation of their ethical reasoning, an area current benchmarks often overlook. We introduce PrinciplismQA, a comprehensive benchmark with 3,648 questions designed to systematically assess LLMs' alignment with core medical ethics. Grounded in Principlism, our benchmark features a high-quality dataset. This includes multiple-choice questions curated from authoritative textbooks and open-ended questions sourced from authoritative medical ethics case study literature, all validated by medical experts. Our experiments reveal a significant gap between models' ethical knowledge and their practical application, especially in dynamically applying ethical principles to real-world scenarios. Most LLMs struggle with dilemmas concerning Beneficence, often over-emphasizing other principles. Frontier closed-source models, driven by strong general capabilities, currently lead the benchmark. Notably, medical domain fine-tuning can enhance models' overall ethical competence, but further progress requires better alignment with medical ethical knowledge. PrinciplismQA offers a scalable framework to diagnose these specific ethical weaknesses, paving the way for more balanced and responsible medical AI.
Abstract:Suicide remains a pressing global health crisis, with over 720,000 deaths annually and millions more affected by suicide ideation (SI) and suicide attempts (SA). Early identification of suicidality-related factors (SrFs), including SI, SA, exposure to suicide (ES), and non-suicidal self-injury (NSSI), is critical for timely intervention. While prior studies have applied AI to detect SrFs in clinical notes, most treat suicidality as a binary classification task, overlooking the complexity of cooccurring risk factors. This study explores the use of generative large language models (LLMs), specifically GPT-3.5 and GPT-4.5, for multi-label classification (MLC) of SrFs from psychiatric electronic health records (EHRs). We present a novel end to end generative MLC pipeline and introduce advanced evaluation methods, including label set level metrics and a multilabel confusion matrix for error analysis. Finetuned GPT-3.5 achieved top performance with 0.94 partial match accuracy and 0.91 F1 score, while GPT-4.5 with guided prompting showed superior performance across label sets, including rare or minority label sets, indicating a more balanced and robust performance. Our findings reveal systematic error patterns, such as the conflation of SI and SA, and highlight the models tendency toward cautious over labeling. This work not only demonstrates the feasibility of using generative AI for complex clinical classification tasks but also provides a blueprint for structuring unstructured EHR data to support large scale clinical research and evidence based medicine.
Abstract:Large language models (LLMs) store vast amounts of information, making them powerful yet raising privacy and safety concerns when selective knowledge removal is required. Existing unlearning strategies, ranging from gradient-based fine-tuning and model editing to sparse autoencoder (SAE) steering, either lack interpretability or fail to provide a robust defense against adversarial prompts. We propose SAE-Guided Subspace Projection Unlearning (SSPU), a novel framework that leverages SAE features to drive targeted updates in the model's parameter space, enabling precise, interpretable, and robust unlearning. SSPU's three-stage pipeline performs data-driven layer and feature selection, subspace construction via QR decomposition, and constrained optimization that controls activations into an "irrelevant" subspace while preserving retained knowledge. Overall, we use SAE features to construct a subspace that supervises unlearning, refining the loss and adding a regularization term to guide interpretable parameter updates. In experiments on the WMDP-Cyber forget set and three utility benchmarks (MMLU, TruthfulQA, GSM8K), SSPU reduces harmful knowledge accuracy by 3.22% compared to the strongest baseline. It also improves adversarial robustness, lowering malicious accuracy under jailbreak prompts compared to baselines. Our findings expose the limitations of prior unlearning methods and demonstrate how interpretable subspace-guided optimization can achieve robust, controllable model behavior.
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in various domains, including radiology report generation. Previous approaches have attempted to utilize multimodal LLMs for this task, enhancing their performance through the integration of domain-specific knowledge retrieval. However, these approaches often overlook the knowledge already embedded within the LLMs, leading to redundant information integration and inefficient utilization of learned representations. To address this limitation, we propose RADAR, a framework for enhancing radiology report generation with supplementary knowledge injection. RADAR improves report generation by systematically leveraging both the internal knowledge of an LLM and externally retrieved information. Specifically, it first extracts the model's acquired knowledge that aligns with expert image-based classification outputs. It then retrieves relevant supplementary knowledge to further enrich this information. Finally, by aggregating both sources, RADAR generates more accurate and informative radiology reports. Extensive experiments on MIMIC-CXR, CheXpert-Plus, and IU X-ray demonstrate that our model outperforms state-of-the-art LLMs in both language quality and clinical accuracy
Abstract:This paper explores optimal data selection strategies for Reinforcement Learning with Verified Rewards (RLVR) training in the medical domain. While RLVR has shown exceptional potential for enhancing reasoning capabilities in large language models, most prior implementations have focused on mathematics and logical puzzles, with limited exploration of domain-specific applications like medicine. We investigate four distinct data sampling strategies from MedQA-USMLE: random sampling (baseline), and filtering using Phi-4, Gemma-3-27b-it, and Gemma-3-12b-it models. Using Gemma-3-12b-it as our base model and implementing Group Relative Policy Optimization (GRPO), we evaluate performance across multiple benchmarks including MMLU, GSM8K, MMLU-Pro, and CMMLU. Our findings demonstrate that models trained on filtered data generally outperform those trained on randomly selected samples. Notably, training on self-filtered samples (using Gemma-3-12b-it for filtering) achieved superior performance in medical domains but showed reduced robustness across different benchmarks, while filtering with larger models from the same series yielded better overall robustness. These results provide valuable insights into effective data organization strategies for RLVR in specialized domains and highlight the importance of thoughtful data selection in achieving optimal performance. You can access our repository (https://github.com/Qsingle/open-medical-r1) to get the codes.
Abstract:Parkinson's disease (PD) is a prevalent neurodegenerative disorder globally. The eye's retina is an extension of the brain and has great potential in PD screening. Recent studies have suggested that texture features extracted from retinal layers can be adopted as biomarkers for PD diagnosis under optical coherence tomography (OCT) images. Frequency domain learning techniques can enhance the feature representations of deep neural networks (DNNs) by decomposing frequency components involving rich texture features. Additionally, previous works have not exploited texture features for automated PD screening in OCT. Motivated by the above analysis, we propose a novel Adaptive Wavelet Filter (AWF) that serves as the Practical Texture Feature Amplifier to fully leverage the merits of texture features to boost the PD screening performance of DNNs with the aid of frequency domain learning. Specifically, AWF first enhances texture feature representation diversities via channel mixer, then emphasizes informative texture feature representations with the well-designed adaptive wavelet filtering token mixer. By combining the AWFs with the DNN stem, AWFNet is constructed for automated PD screening. Additionally, we introduce a novel Balanced Confidence (BC) Loss by mining the potential of sample-wise predicted probabilities of all classes and class frequency prior, to further boost the PD screening performance and trustworthiness of AWFNet. The extensive experiments manifest the superiority of our AWFNet and BC over state-of-the-art methods in terms of PD screening performance and trustworthiness.
Abstract:Large language models (LLMs) are increasingly applied to outpatient referral tasks across healthcare systems. However, there is a lack of standardized evaluation criteria to assess their effectiveness, particularly in dynamic, interactive scenarios. In this study, we systematically examine the capabilities and limitations of LLMs in managing tasks within Intelligent Outpatient Referral (IOR) systems and propose a comprehensive evaluation framework specifically designed for such systems. This framework comprises two core tasks: static evaluation, which focuses on evaluating the ability of predefined outpatient referrals, and dynamic evaluation, which evaluates capabilities of refining outpatient referral recommendations through iterative dialogues. Our findings suggest that LLMs offer limited advantages over BERT-like models, but show promise in asking effective questions during interactive dialogues.
Abstract:Objective. This paper presents an overview of generalizable and explainable artificial intelligence (XAI) in deep learning (DL) for medical imaging, aimed at addressing the urgent need for transparency and explainability in clinical applications. Methodology. We propose to use four CNNs in three medical datasets (brain tumor, skin cancer, and chest x-ray) for medical image classification tasks. In addition, we perform paired t-tests to show the significance of the differences observed between different methods. Furthermore, we propose to combine ResNet50 with five common XAI techniques to obtain explainable results for model prediction, aiming at improving model transparency. We also involve a quantitative metric (confidence increase) to evaluate the usefulness of XAI techniques. Key findings. The experimental results indicate that ResNet50 can achieve feasible accuracy and F1 score in all datasets (e.g., 86.31\% accuracy in skin cancer). Furthermore, the findings show that while certain XAI methods, such as XgradCAM, effectively highlight relevant abnormal regions in medical images, others, like EigenGradCAM, may perform less effectively in specific scenarios. In addition, XgradCAM indicates higher confidence increase (e.g., 0.12 in glioma tumor) compared to GradCAM++ (0.09) and LayerCAM (0.08). Implications. Based on the experimental results and recent advancements, we outline future research directions to enhance the robustness and generalizability of DL models in the field of biomedical imaging.
Abstract:A comprehensive and explicit understanding of surgical scenes plays a vital role in developing context-aware computer-assisted systems in the operating theatre. However, few works provide systematical analysis to enable hierarchical surgical scene understanding. In this work, we propose to represent the tasks set [phase recognition --> step recognition --> action and instrument detection] as multi-level semantic scene understanding (MSSU). For this target, we propose a novel hierarchical context transformer (HCT) network and thoroughly explore the relations across the different level tasks. Specifically, a hierarchical relation aggregation module (HRAM) is designed to concurrently relate entries inside multi-level interaction information and then augment task-specific features. To further boost the representation learning of the different tasks, inter-task contrastive learning (ICL) is presented to guide the model to learn task-wise features via absorbing complementary information from other tasks. Furthermore, considering the computational costs of the transformer, we propose HCT+ to integrate the spatial and temporal adapter to access competitive performance on substantially fewer tunable parameters. Extensive experiments on our cataract dataset and a publicly available endoscopic PSI-AVA dataset demonstrate the outstanding performance of our method, consistently exceeding the state-of-the-art methods by a large margin. The code is available at https://github.com/Aurora-hao/HCT.
Abstract:Fine-tuning significantly improves the performance of Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. This paper aims to provide an in-depth interpretation of the fine-tuning process through circuit analysis, a popular tool in Mechanistic Interpretability (MI). Unlike previous studies \cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity} that focus on tasks where pre-trained models already perform well, we develop a set of mathematical tasks where fine-tuning yields substantial performance gains, which are closer to the practical setting. In our experiments, we identify circuits at various checkpoints during fine-tuning and examine the interplay between circuit analysis, fine-tuning methods, and task complexities. First, we find that while circuits maintain high node similarity before and after fine-tuning, their edges undergo significant changes, which is in contrast to the previous work \cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity} that show circuits only add some additional components after fine-tuning. Based on these observations, we develop a circuit-aware Low-Rank Adaptation (LoRA) method, which assigns ranks to layers based on edge changes in the circuits. Experimental results demonstrate that our circuit-based LoRA algorithm achieves an average performance improvement of 2.46\% over standard LoRA with similar parameter sizes. Furthermore, we explore how combining circuits from subtasks can enhance fine-tuning in compositional tasks, providing new insights into the design of such tasks and deepening the understanding of circuit dynamics and fine-tuning mechanisms.