Publication: Large Language Models for Automated Evaluation of Radiology Reports with Fine-Grained Scoring
Open/View Files
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
The current gold standard for evaluating generated chest x-ray (CXR) reports is through radiologist annotations. However, this process can be extremely time-consuming, especially if there are large numbers of reports to evaluate. In this work, we present a Large Language Model (LLM)-based automated evaluation metric for generated CXR reports called FineRadScore. Given a candidate and a ground truth report, FineRadScore gives the minimum number of line by line corrections required to go from the candidate to the ground truth report. Additionally, FineRadScore assigns a severity rating for each correction and generates comments regarding why the correction was needed. We demonstrate that FineRadScore is able to generate the corrections in a way that aligns with radiologists and has an understanding of how clinically meaningful each error is. We also demonstrate that, when used to get a sense of the quality of the report as a whole, it aligns with radiologists at a similar level to current state of the art automated CXR evaluation metrics. Finally, we analyze FineRadScore's shortcomings to pave the way for future works.