Background: Synthetic computed tomography (sCT) has been proposed and increasingly clinically adopted to enable magnetic resonance imaging (MRI)-based radiotherapy. Deep learning (DL) has recently demonstrated the ability to generate accurate sCT from fixed MRI acquisitions. However, MRI protocols may change over time or differ between centres resulting in low-quality sCT due to poor model generalisation. Purpose: investigating domain randomisation (DR) to increase the generalisation of a DL model for brain sCT generation. Methods: CT and corresponding T1-weighted MRI with/without contrast, T2-weighted, and FLAIR MRI from 95 patients undergoing RT were collected, considering FLAIR the unseen sequence where to investigate generalisation. A ``Baseline'' generative adversarial network was trained with/without the FLAIR sequence to test how a model performs without DR. Image similarity and accuracy of sCT-based dose plans were assessed against CT to select the best-performing DR approach against the Baseline. Results: The Baseline model had the poorest performance on FLAIR, with mean absolute error (MAE)=106$\pm$20.7 HU (mean$\pm\sigma$). Performance on FLAIR significantly improved for the DR model with MAE=99.0$\pm$14.9 HU, but still inferior to the performance of the Baseline+FLAIR model (MAE=72.6$\pm$10.1 HU). Similarly, an improvement in $\gamma$-pass rate was obtained for DR vs Baseline. Conclusions: DR improved image similarity and dose accuracy on the unseen sequence compared to training only on acquired MRI. DR makes the model more robust, reducing the need for re-training when applying a model on sequences unseen and unavailable for retraining.
This study evaluated novel AI and deep learning generative methods to address AI bias for retinal diagnostic applications when specifically applied to diabetic retinopathy (DR). Bias often results from data imbalance. We specifically considered here a strong form of data imbalance corresponding to domain shift, where AI classifiers are faced at inference time with data and concepts they were not trained on initially (here the concept of diseased black individuals). A baseline DR diagnostics DLS designed to solve a two-class problem of referable vs not referable DR was used. We modified the public domain Kaggle-EyePACS dataset (88,692 fundi and 44,346 individuals), which was originally designed to be diverse with regard to ethnicity, as follows: 1) we expanded it to include clinician-annotated labels for race since those were not publicly available; 2) we excluded training exemplars for diseased black individuals in training, but not testing, to construct a new scenario of data imbalance with domain shift. For this domain shifted scenario, the accuracy (95% confidence intervals [CI]) of the baseline DR diagnostics DLS for whites was 73.0% (66.9%,79.2%) vs. blacks of 60.5% (53.5%,67.3%], demonstrating disparity of AI performance as measured by accuracy across races. By contrast, an AI approach leveraging generative models was used to train a new diagnostic DLS with additional synthetically generated data for the missing subpopulation (diseased blacks), which achieved accuracy for whites of 77.5% (71.7%,83.3%) and for blacks of 70.0% (63.7%,76.4%), demonstrating closer parity in accuracy across races. The new debiased DLS also showed improvement in sensitivity of over 21% for blacks, with the same level of specificity, when compared with the baseline DLS. These findings demonstrate the potential benefits of using novel generative methods for debiasing AI.
Purpose: To demonstrate that retinal microvasculature per se is a reliable biomarker for Diabetic Retinopathy (DR) and, by extension, cardiovascular diseases. Methods: Deep Learning Convolutional Neural Networks (CNN) applied to color fundus images for semantic segmentation of the blood vessels and severity classification on both vascular and full images. Vessel reconstruction through harmonic descriptors is also used as a smoothing and de-noising tool. The mathematical background of the theory is also outlined. Results: For diabetic patients, at least 93.8% of DR No-Refer vs. Refer classification can be related to vasculature defects. As for the Non-Sight Threatening vs. Sight Threatening case, the ratio is as high as 96.7%. Conclusion: In the case of DR, most of the disease biomarkers are related topologically to the vasculature. Translational Relevance: Experiments conducted on eye blood vasculature reconstruction as a biomarker shows a strong correlation between vasculature shape and later stages of DR.
Self supervised contrastive learning based pretraining allows development of robust and generalized deep learning models with small, labeled datasets, reducing the burden of label generation. This paper aims to evaluate the effect of CL based pretraining on the performance of referrable vs non referrable diabetic retinopathy (DR) classification. We have developed a CL based framework with neural style transfer (NST) augmentation to produce models with better representations and initializations for the detection of DR in color fundus images. We compare our CL pretrained model performance with two state of the art baseline models pretrained with Imagenet weights. We further investigate the model performance with reduced labeled training data (down to 10 percent) to test the robustness of the model when trained with small, labeled datasets. The model is trained and validated on the EyePACS dataset and tested independently on clinical data from the University of Illinois, Chicago (UIC). Compared to baseline models, our CL pretrained FundusNet model had higher AUC (CI) values (0.91 (0.898 to 0.930) vs 0.80 (0.783 to 0.820) and 0.83 (0.801 to 0.853) on UIC data). At 10 percent labeled training data, the FundusNet AUC was 0.81 (0.78 to 0.84) vs 0.58 (0.56 to 0.64) and 0.63 (0.60 to 0.66) in baseline models, when tested on the UIC dataset. CL based pretraining with NST significantly improves DL classification performance, helps the model generalize well (transferable from EyePACS to UIC data), and allows training with small, annotated datasets, therefore reducing ground truth annotation burden of the clinicians.
Deep learning algorithms have been used to detect diabetic retinopathy (DR) with specialist-level accuracy. This study aims to validate one such algorithm on a large-scale clinical population, and compare the algorithm performance with that of human graders. 25,326 gradable retinal images of patients with diabetes from the community-based, nation-wide screening program of DR in Thailand were analyzed for DR severity and referable diabetic macular edema (DME). Grades adjudicated by a panel of international retinal specialists served as the reference standard. Across different severity levels of DR for determining referable disease, deep learning significantly reduced the false negative rate (by 23%) at the cost of slightly higher false positive rates (2%). Deep learning algorithms may serve as a valuable tool for DR screening.
Generic text rewriting is a prevalent large language model (LLM) application that covers diverse real-world tasks, such as style transfer, fact correction, and email editing. These tasks vary in rewriting objectives (e.g., factual consistency vs. semantic preservation), making it challenging to develop a unified model that excels across all dimensions. Existing methods often specialize in either a single task or a specific objective, limiting their generalizability. In this work, we introduce a generic model proficient in factuality, stylistic, and conversational rewriting tasks. To simulate real-world user rewrite requests, we construct a conversational rewrite dataset, ChatRewrite, that presents ``natural''-sounding instructions, from raw emails using LLMs. Combined with other popular rewrite datasets, including LongFact for the factuality rewrite task and RewriteLM for the stylistic rewrite task, this forms a broad benchmark for training and evaluating generic rewrite models. To align with task-specific objectives, we propose Dr Genre, a Decoupled-reward learning framework for Generic rewriting, that utilizes objective-oriented reward models with a task-specific weighting. Evaluation shows that \approach delivers higher-quality rewrites across all targeted tasks, improving objectives including instruction following (agreement), internal consistency (coherence), and minimal unnecessary edits (conciseness).
Diabetic retinopathy (DR) is a diabetes complication that affects eyes. DR is a primary cause of blindness in working-age people and it is estimated that 3 to 4 million people with diabetes are blinded by DR every year worldwide. Early diagnosis have been considered an effective way to mitigate such problem. The ultimate goal of our research is to develop novel machine learning techniques to analyze the DR images generated by the fundus camera for automatically DR diagnosis. In this paper, we focus on identifying small lesions on DR fundus images. The results from our analysis, which include the lesion category and their exact locations in the image, can be used to facilitate the determination of DR severity (indicated by DR stages). Different from traditional object detection for natural images, lesion detection for fundus images have unique challenges. Specifically, the size of a lesion instance is usually very small, compared with the original resolution of the fundus images, making them diffcult to be detected. We analyze the lesion-vs-image scale carefully and propose a large-size feature pyramid network (LFPN) to preserve more image details for mini lesion instance detection. Our method includes an effective region proposal strategy to increase the sensitivity. The experimental results show that our proposed method is superior to the original feature pyramid network (FPN) method and Faster RCNN.
We introduce Humans-Junior, a 3.8B model that matches GPT-4o on the FACTS Grounding public subset within a $\pm 5$ pp equivalence margin. Results. On Q1--Q500 under identical judges, GPT-4o scores 73.5% (95% CI 69.5--77.2) and Humans-Junior 72.7% (95% CI 68.7--76.5); the paired difference is 0.8 pp (bootstrap 95% CI $-3.1$ to $+4.7$; permutation $p = 0.72$; Cohen's $d = 0.023$). TOST establishes equivalence at $\pm 5$ pp (not at $\pm 3$ pp). When purchased as managed APIs, Humans-Junior's base model (Phi-3.5-mini-instruct) is $\approx 19\times$ less expensive than GPT-4o on Microsoft AI Foundry pricing; self-hosted or edge deployments can drive incremental inference cost toward zero. Measured vs estimated pricing sources are tabulated in Appendix E. Method. Our approach combines minimal directed "Exoskeleton Reasoning" scaffolds with behavioral fine-tuning that teaches protocol compliance (epistemic discipline) rather than domain answers. Fine-tuning alone adds little; combined, they synergize (+17.7 pp, $p < 0.001$) and reduce variance ($\approx 25\%$). In prompt-only settings on frontier models (Q1--Q100; non-comparable), directed reasoning improved GPT-4o by +11.8 pp to 85.3% and Gemini-2.5-Pro by +5.0 pp to 93.3% (baseline 88.3%, $n = 100$); see Section~5. TL;DR. A 3.8B model achieves GPT-4o-level FACTS accuracy (equivalent within $\pm 5$ pp on Q1--Q500). Cloud pricing shows $\approx 19\times$ lower cost versus GPT-4o, and self-hosted/edge deployments can approach zero marginal cost. Pricing sources are listed in Appendix E. Frontier prompt-only gains (Q1--Q100; non-comparable) and optimized-prompt exploratory results under earlier judges are summarized in Appendix F. Keywords: Small Language Models, Factual Grounding, Directed Reasoning, Fine-Tuning, Model Alignment, Cost-Efficient AI
Generative pre-trained language models (GPLMs) like ChatGPT encode in the model's parameters knowledge the models observe during the pre-training phase. This knowledge is then used at inference to address the task specified by the user in their prompt. For example, for the question-answering task, the GPLMs leverage the knowledge and linguistic patterns learned at training to produce an answer to a user question. Aside from the knowledge encoded in the model itself, answers produced by GPLMs can also leverage knowledge provided in the prompts. For example, a GPLM can be integrated into a retrieve-then-generate paradigm where a search engine is used to retrieve documents relevant to the question; the content of the documents is then transferred to the GPLM via the prompt. In this paper we study the differences in answer correctness generated by ChatGPT when leveraging the model's knowledge alone vs. in combination with the prompt knowledge. We study this in the context of consumers seeking health advice from the model. Aside from measuring the effectiveness of ChatGPT in this context, we show that the knowledge passed in the prompt can overturn the knowledge encoded in the model and this is, in our experiments, to the detriment of answer correctness. This work has important implications for the development of more robust and transparent question-answering systems based on generative pre-trained language models.
We investigate a fundamental structural question in Grouped Query Attention (GQA) transformers: do the layers most sensitive to task correctness coincide with the layers where positional encoding adaptation has the greatest leverage? We term this the co-localization hypothesis and test it on Llama 3.1 8B, a 32-layer GQA model with a 4:1 query-to-key-value head ratio. We introduce \LSLORA, which restricts LoRA adaptation to layers identified via a novel correctness-differential hidden-state metric, and GARFA (GQA-Aware RoPE Frequency Adaptation), which attaches 8 learnable per-KV-head scalar multipliers to each targeted layer. Contrary to the co-localization hypothesis, we discover strong anti-localization: task-sensitive layers concentrate in the late network ($\ell\in\{23\text{-}31\}$) while RoPE-influential layers dominate the early network ($\ell\in\{0\text{-}9\}$), yielding Spearman $r_s = -0.735$ ($p = 1.66\times10^{-6}$). Despite this anti-localization, a 4-way cross-layer ablation shows that applying both interventions to the sensitivity-identified layers outperforms all alternative configurations by 4-16 percentage points across six diverse benchmarks (MMLU, GPQA, HumanEval+, MATH, MGSM, ARC), approaching Claude 3.5 Haiku on HumanEval+ (67.1% vs. 68.3%) at \$100 total compute cost.