Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Benjamin M. Mervak

CLEAR: A Clinically-Grounded Tabular Framework for Radiology Report Evaluation

May 22, 2025

Yuyang Jiang, Chacha Chen, Shengyuan Wang, Feng Li, Zecong Tang, Benjamin M. Mervak, Lydia Chelala, Christopher M Straus, Reve Chahine, Samuel G. Armato III(+1 more)

Figure 1 for CLEAR: A Clinically-Grounded Tabular Framework for Radiology Report Evaluation

Figure 2 for CLEAR: A Clinically-Grounded Tabular Framework for Radiology Report Evaluation

Figure 3 for CLEAR: A Clinically-Grounded Tabular Framework for Radiology Report Evaluation

Figure 4 for CLEAR: A Clinically-Grounded Tabular Framework for Radiology Report Evaluation

Abstract:Existing metrics often lack the granularity and interpretability to capture nuanced clinical differences between candidate and ground-truth radiology reports, resulting in suboptimal evaluation. We introduce a Clinically-grounded tabular framework with Expert-curated labels and Attribute-level comparison for Radiology report evaluation (CLEAR). CLEAR not only examines whether a report can accurately identify the presence or absence of medical conditions, but also assesses whether it can precisely describe each positively identified condition across five key attributes: first occurrence, change, severity, descriptive location, and recommendation. Compared to prior works, CLEAR's multi-dimensional, attribute-level outputs enable a more comprehensive and clinically interpretable evaluation of report quality. Additionally, to measure the clinical alignment of CLEAR, we collaborate with five board-certified radiologists to develop CLEAR-Bench, a dataset of 100 chest X-ray reports from MIMIC-CXR, annotated across 6 curated attributes and 13 CheXpert conditions. Our experiments show that CLEAR achieves high accuracy in extracting clinical attributes and provides automated metrics that are strongly aligned with clinical judgment.

* 18 pages, 4 figures

Via

Access Paper or Ask Questions

GPT-4V Cannot Generate Radiology Reports Yet

Jul 16, 2024

Yuyang Jiang, Chacha Chen, Dang Nguyen, Benjamin M. Mervak, Chenhao Tan

Figure 1 for GPT-4V Cannot Generate Radiology Reports Yet

Figure 2 for GPT-4V Cannot Generate Radiology Reports Yet

Figure 3 for GPT-4V Cannot Generate Radiology Reports Yet

Figure 4 for GPT-4V Cannot Generate Radiology Reports Yet

Abstract:GPT-4V's purported strong multimodal abilities raise interests in using it to automate radiology report writing, but there lacks thorough evaluations. In this work, we perform a systematic evaluation of GPT-4V in generating radiology reports on two chest X-ray report datasets: MIMIC-CXR and IU X-Ray. We attempt to directly generate reports using GPT-4V through different prompting strategies and find that it fails terribly in both lexical metrics and clinical efficacy metrics. To understand the low performance, we decompose the task into two steps: 1) the medical image reasoning step of predicting medical condition labels from images; and 2) the report synthesis step of generating reports from (groundtruth) conditions. We show that GPT-4V's performance in image reasoning is consistently low across different prompts. In fact, the distributions of model-predicted labels remain constant regardless of which groundtruth conditions are present on the image, suggesting that the model is not interpreting chest X-rays meaningfully. Even when given groundtruth conditions in report synthesis, its generated reports are less correct and less natural-sounding than a finetuned LLaMA-2. Altogether, our findings cast doubt on the viability of using GPT-4V in a radiology workflow.

* 24 pages, 3 figures, code: https://github.com/YuyangJ0/GPT-4V-evaluation-radiology-report

Via

Access Paper or Ask Questions