Abstract:We present NoReGeo, a novel benchmark designed to evaluate the intrinsic geometric understanding of large language models (LLMs) without relying on reasoning or algebraic computation. Unlike existing benchmarks that primarily assess models' proficiency in reasoning-based geometry-where solutions are derived using algebraic methods-NoReGeo focuses on evaluating whether LLMs can inherently encode spatial relationships and recognize geometric properties directly. Our benchmark comprises 2,500 trivial geometric problems spanning 25 categories, each carefully crafted to be solvable purely through native geometric understanding, assuming known object locations. We assess a range of state-of-the-art models on NoReGeo, including frontier models like GPT-4, observing that even the most advanced systems achieve an overall maximum of 65% accuracy in binary classification tasks. Further, our ablation experiments demonstrate that such geometric understanding does not emerge through fine-tuning alone, indicating that effective training for geometric comprehension requires a specialized approach from the outset. Our findings highlight a significant gap in current LLMs' ability to natively grasp geometric concepts, providing a foundation for future research toward models with true geometric cognition.




Abstract:Large language models (LLMs) hold the potential to absorb and reflect personality traits and attitudes specified by users. In our study, we investigated this potential using robust psychometric measures. We adapted the most studied test in psychological literature, namely Minnesota Multiphasic Personality Inventory (MMPI) and examined LLMs' behavior to identify traits. To asses the sensitivity of LLMs' prompts and psychological biases we created personality-oriented prompts, crafting a detailed set of personas that vary in trait intensity. This enables us to measure how well LLMs follow these roles. Our study introduces MindShift, a benchmark for evaluating LLMs' psychological adaptability. The results highlight a consistent improvement in LLMs' role perception, attributed to advancements in training datasets and alignment techniques. Additionally, we observe significant differences in responses to psychometric assessments across different model types and families, suggesting variability in their ability to emulate human-like personality traits. MindShift prompts and code for LLM evaluation will be publicly available.
Abstract:Detecting out-of-distribution (OOD) samples for trusted medical image segmentation remains a significant challenge. The critical issue here is the lack of a strict definition of abnormal data, which often results in artificial problem settings without measurable clinical impact. In this paper, we redesign the OOD detection problem according to the specifics of volumetric medical imaging and related downstream tasks (e.g., segmentation). We propose using the downstream model's performance as a pseudometric between images to define abnormal samples. This approach enables us to weigh different samples based on their performance impact without an explicit ID/OOD distinction. We incorporate this weighting in a new metric called Expected Performance Drop (EPD). EPD is our core contribution to the new problem design, allowing us to rank methods based on their clinical impact. We demonstrate the effectiveness of EPD-based evaluation in 11 CT and MRI OOD detection challenges.




Abstract:Deep Learning models perform unreliably when the data comes from a distribution different from the training one. In critical applications such as medical imaging, out-of-distribution (OOD) detection methods help to identify such data samples, preventing erroneous predictions. In this paper, we further investigate the OOD detection effectiveness when applied to 3D medical image segmentation. We design several OOD challenges representing clinically occurring cases and show that none of these methods achieve acceptable performance. Methods not dedicated to segmentation severely fail to perform in the designed setups; their best mean false positive rate at 95% true positive rate (FPR) is 0.59. Segmentation-dedicated ones still achieve suboptimal performance, with the best mean FPR of 0.31 (lower is better). To indicate this suboptimality, we develop a simple method called Intensity Histogram Features (IHF), which performs comparable or better in the same challenges, with a mean FPR of 0.25. Our findings highlight the limitations of the existing OOD detection methods on 3D medical images and present a promising avenue for improving them. To facilitate research in this area, we release the designed challenges as a publicly available benchmark and formulate practical criteria to test the OOD detection generalization beyond the suggested benchmark. We also propose IHF as a solid baseline to contest the emerging methods.




Abstract:Deep Learning (DL) models tend to perform poorly when the data comes from a distribution different from the training one. In critical applications such as medical imaging, out-of-distribution (OOD) detection helps to identify such data samples, increasing the model's reliability. Recent works have developed DL-based OOD detection that achieves promising results on 2D medical images. However, scaling most of these approaches on 3D images is computationally intractable. Furthermore, the current 3D solutions struggle to achieve acceptable results in detecting even synthetic OOD samples. Such limited performance might indicate that DL often inefficiently embeds large volumetric images. We argue that using the intensity histogram of the original CT or MRI scan as embedding is descriptive enough to run OOD detection. Therefore, we propose a histogram-based method that requires no DL and achieves almost perfect results in this domain. Our proposal is supported two-fold. We evaluate the performance on the publicly available datasets, where our method scores 1.0 AUROC in most setups. And we score second in the Medical Out-of-Distribution challenge without fine-tuning and exploiting task-specific knowledge. Carefully discussing the limitations, we conclude that our method solves the sample-level OOD detection on 3D medical images in the current setting.




Abstract:When applying a Deep Learning model to medical images, it is crucial to estimate the model uncertainty. Voxel-wise uncertainty is a useful visual marker for human experts and could be used to improve the model's voxel-wise output, such as segmentation. Moreover, uncertainty provides a solid foundation for out-of-distribution (OOD) detection, improving the model performance on the image-wise level. However, one of the frequent tasks in medical imaging is the segmentation of distinct, local structures such as tumors or lesions. Here, the structure-wise uncertainty allows more precise operations than image-wise and more semantic-aware than voxel-wise. The way to produce uncertainty for individual structures remains poorly explored. We propose a framework to measure the structure-wise uncertainty and evaluate the impact of OOD data on the model performance. Thus, we identify the best UE method to improve the segmentation quality. The proposed framework is tested on three datasets with the tumor segmentation task: LIDC-IDRI, LiTS, and a private one with multiple brain metastases cases.