Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ben Schaper

Measuring and Aligning Abstraction in Vision-Language Models with Medical Taxonomies

Jan 21, 2026

Ben Schaper, Maxime Di Folco, Bernhard Kainz, Julia A. Schnabel, Cosmin I. Bercea

Abstract:Vision-Language Models show strong zero-shot performance for chest X-ray classification, but standard flat metrics fail to distinguish between clinically minor and severe errors. This work investigates how to quantify and mitigate abstraction errors by leveraging medical taxonomies. We benchmark several state-of-the-art VLMs using hierarchical metrics and introduce Catastrophic Abstraction Errors to capture cross-branch mistakes. Our results reveal substantial misalignment of VLMs with clinical taxonomies despite high flat performance. To address this, we propose risk-constrained thresholding and taxonomy-aware fine-tuning with radial embeddings, which reduce severe abstraction errors to below 2 per cent while maintaining competitive performance. These findings highlight the importance of hierarchical evaluation and representation-level alignment for safer and more clinically meaningful deployment of VLMs.

Via

Access Paper or Ask Questions

Towards Interpretable Summary Evaluation via Allocation of Contextual Embeddings to Reference Text Topics

Oct 25, 2022

Ben Schaper, Christopher Lohse, Marcell Streile, Andrea Giovannini, Richard Osuala

Figure 1 for Towards Interpretable Summary Evaluation via Allocation of Contextual Embeddings to Reference Text Topics

Figure 2 for Towards Interpretable Summary Evaluation via Allocation of Contextual Embeddings to Reference Text Topics

Figure 3 for Towards Interpretable Summary Evaluation via Allocation of Contextual Embeddings to Reference Text Topics

Figure 4 for Towards Interpretable Summary Evaluation via Allocation of Contextual Embeddings to Reference Text Topics

Abstract:Despite extensive recent advances in summary generation models, evaluation of auto-generated summaries still widely relies on single-score systems insufficient for transparent assessment and in-depth qualitative analysis. Towards bridging this gap, we propose the multifaceted interpretable summary evaluation method (MISEM), which is based on allocation of a summary's contextual token embeddings to semantic topics identified in the reference text. We further contribute an interpretability toolbox for automated summary evaluation and interactive visual analysis of summary scoring, topic identification, and token-topic allocation. MISEM achieves a promising .404 Pearson correlation with human judgment on the TAC'08 dataset.

* 5 pages, 3 figures

Via

Access Paper or Ask Questions