Abstract:This study quantitatively evaluates the impact of MRI scanner magnetic field strength on the performance and generalizability of deep learning-based segmentation algorithms. Three publicly available MRI datasets (breast tumor, pancreas, and cervical spine) were stratified by scanner field strength (1.5T vs. 3.0T). For each segmentation task, three nnU-Net-based models were developed: A model trained on 1.5T data only (m-1.5T), a model trained on 3.0T data only (m-3.0T), and a model trained on pooled 1.5T and 3.0T data (m-combined). Each model was evaluated on both 1.5T and 3.0T validation sets. Field-strength-dependent performance differences were investigated via Uniform Manifold Approximation and Projection (UMAP)-based clustering and radiomic analysis, including 23 first-order and texture features. For breast tumor segmentation, m-3.0T (DSC: 0.494 [1.5T] and 0.433 [3.0T]) significantly outperformed m-1.5T (DSC: 0.411 [1.5T] and 0.289 [3.0T]) and m-combined (DSC: 0.373 [1.5T] and 0.268[3.0T]) on both validation sets (p<0.0001). Pancreas segmentation showed similar trends: m-3.0T achieved the highest DSC (0.774 [1.5T], 0.840 [3.0T]), while m-1.5T underperformed significantly (p<0.0001). For cervical spine, models performed optimally on same-field validation sets with minimal cross-field performance degradation (DSC>0.92 for all comparisons). Radiomic analysis revealed moderate field-strength-dependent clustering in soft tissues (silhouette scores 0.23-0.29) but minimal separation in osseous structures (0.12). These results indicate that magnetic field strength in the training data substantially influences the performance of deep learning-based segmentation models, particularly for soft-tissue structures (e.g., small lesions). This warrants consideration of magnetic field strength as a confounding factor in studies evaluating AI performance on MRI.




Abstract:Large Language Models (LLMs) have benefited enormously from scaling, yet these gains are bounded by five fundamental limitations: (1) hallucination, (2) context compression, (3) reasoning degradation, (4) retrieval fragility, and (5) multimodal misalignment. While existing surveys describe these phenomena empirically, they lack a rigorous theoretical synthesis connecting them to the foundational limits of computation, information, and learning. This work closes that gap by presenting a unified, proof-informed framework that formalizes the innate theoretical ceilings of LLM scaling. First, computability and uncomputability imply an irreducible residue of error: for any computably enumerable model family, diagonalization guarantees inputs on which some model must fail, and undecidable queries (e.g., halting-style tasks) induce infinite failure sets for all computable predictors. Second, information-theoretic and statistical constraints bound attainable accuracy even on decidable tasks, finite description length enforces compression error, and long-tail factual knowledge requires prohibitive sample complexity. Third, geometric and computational effects compress long contexts far below their nominal size due to positional under-training, encoding attenuation, and softmax crowding. We further show how likelihood-based training favors pattern completion over inference, how retrieval under token limits suffers from semantic drift and coupling noise, and how multimodal scaling inherits shallow cross-modal alignment. Across sections, we pair theorems and empirical evidence to outline where scaling helps, where it saturates, and where it cannot progress, providing both theoretical foundations and practical mitigation paths like bounded-oracle retrieval, positional curricula, and sparse or hierarchical attention.