The objective of this study is to evaluate the capability of a large language model (ChatGPT 4o) to generate a clinically reliable radiology report based on standardized orbital MRI images. The focus is on free-text report generation to simulate a real-world reporting activity. Subsequently, each LLM-generated report is compared to the standard-of-care report composed by board-certified radiologist expert in orbital pathology. Validated computational metrics are employed to evaluate textual correspondence between LLM-generated reports and radiologist-generated reports, as well as the validity of extracted features. In this study, the authors aim to define both the capabilities and critical limitations of transformer-based models in clinical practice, defining a working frame applicable not only to orbital pathology, but generalizable to other anatomical regions.
The aim of this pilot study is to evaluate the fidelity and clinical relevance of radiology reports generated by a LLM. Study design consists of a structured comparison between human-authored reports and reports generated by ChatGPT 4o using metrics validated for LLMs able to assess both textual correspondence, semantics and content. While LLM accounts only for the late step of generating the written report, the accuracy of report indirectly evaluates a stepwise computer-vision flow starting from visual pattern extraction using convolutional neural networks (CNN), specifically 2D U-Net architecture, which is particularly suitable for biomedical image segmentation due to its symmetric encoder-decoder structure with skip connections that preserve fine spatial details during downsampling and upsampling. Following segmentation, vision transformers (ViT) are used to encode into structured tokens features of each lesion, such as size, location, contrast enhancement, anatomical boundaries. The LLM was tuned on radiology reports to generate a clinically coherent narrative incorporating domain-specific terms and lexicon. The sum of these steps relates to the accuracy of the AI-generated report, which thus measures not just the LLM performance, but also previous computer vision elaboration steps.
This evaluation was performed by screening clinical databases of two hub departments of maxillofacial surgery of Northern Italy, respectively the Academic Hospital of the University of Udine and San Paolo Hospital of the Milan University. Data eligible for this study included MR scans of orbital lesions with appropriate acquisition quality, absence of artifacts, T1 or T2, both contrast-enhanced or basal, with a radiology report written according to standardized guidelines.
All procedures were performed in accordance with the relevant institutional guidelines and regulations, and the study adhered to the principles outlined in the Declaration of Helsinki. Ethical approval was granted by the Institutional Review Board of the University of Udine (approval number: IRB_45_2020). Written informed consent was obtained from all participants or their legal guardians. All radiological data were anonymized prior to analysis.
Although MR data was originally acquired and processed in volumetric NIfTI format to preserve spatial integrity and quantitative voxel information, the subsequent integration into the LLM was performed using selected two-dimensional (2D) key screen captures.The impossibility to implement LLM to interpret volumetric radiological data was driven by both technical limitations inherent to current LLM architectures and pragmatic considerations regarding standardization of interpretation. Using screenshots was deemed a more standardizable approach across cases, reducing variability introduced by different slice thickness, acquisition parameters, or specific artifacts that may affect volumetric data. Moreover, computer vision models integrated in GPT-4o or Gemini, while multimodal and capable of processing visual inputs, are currently programmed for 2D image understanding and lack the possibility to directly handle volumetric data in NIfTI or DICOM formats. In fact, a three-dimensional processing ability would require voxel-wise processing, sequential transformations and complex spatial reasoning across multiple planes, which nowadays would require a custom pipeline and the integration with external tools such as MONAI Label (Linux Foundation AI), 3D Slicer (Brigham and Women's Hospital, USA) or nnU-Net. Figure 1 schematically illustrates the proposed analytical workflow.
Orbit MR datasets were exported from the institutional PACS archive as anonymized NIfTI files. For each patient, the region of interest (ROI) was centered on the lesion, and three orthogonal 2D slices (axial, coronal, sagittal) were extracted. In case of isotropic 3D acquisitions (voxel size ≤ 1 mm³, typically volumetric VIBE sequences), multiplanar reconstructions were generated from the same dataset; otherwise, dedicated planar coronal and sagittal sequences were used. Using 3DSlicer (Brigham and Women's Hospital, Boston, Massachusetts, USA) all images were resampled to a uniform voxel size of 1 × 1 × 1 mm using trilinear interpolation, then cropped around the ROI and padded with black borders to a standard size of 256 × 256 pixels. To reduce differences between different MR scanners and sequences, intensity values were normalized (mean 0, standard deviation 1), and the final outputs were stored in PNG format for downstream analysis using transformer-based LLMs.
Only one MR sequence was selected for each case between T1-weighted or T2-weighted, depending on the individual features of the lesions. In general, masses with static fluids, like those with a fluid content, were studied in T2 sequences. Vascular sequences, including time-of-flight (TOF) and 3D phase-contrast (PC) imaging, were excluded due to their selective enhancement of vascular structures, which limits the visualization of surrounding anatomical landmarks. Similarly, diffusion-weighted imaging (DWI) was considered unsuitable given its inherently low spatial resolution and poor anatomical delineation.
Inclusion criteria were the following: presence of a soft tissue orbital lesion, at least 256 × 256 voxel matrix, visualization of at least two of the following anatomical structures: eye globe, optic nerve, inferior rectus, superior rectus, medial rectus, lateral rectus, superior obliquus, possibility to identify the lesion boundaries. Exclusion criteria were: low quality images, image with a artifacts, lesions whose localization prevented the visualization of at least two of the aforementioned structures. The study did not include MR of patients for which a report from a board certified radiologist was not available. Radiology reports were translated to English using DeepL (DeepL SE, Köln, Germany) and checked by a native English speaker to decrease potential biases.
Textual prompts for ChatGPT-4 followed a standardized input structure that than be resumed in the following points: (1) instructions directing the LLM to perform as a board-certified radiologist and avoid statements unsupported by image analysis; (2) always specifying to consider the radiological convention of flipped side to correctly locate the side of disease; (3) providing case-specific data, including the definition of standardized 2D MRI views (axial, coronal, sagittal), sequence type, presence of contrast enhancement, and a brief clinical context; and (4) output constraints, requiring a structured report with sections on anatomical relationships, lesion characteristics in terms of radiological features, such as contrast enhancement, regional hypo- or hyperintensity, and conclusive diagnostic impression. The same structure was used across all prompts for each case.
For the purposes of this study, segmentation and feature extraction tasks were entirely processed end-to-end by GPT-4o's native multimodal architecture rather than through custom-built pipelines based on external libraries. For each case, the model received three lesion-centered, standardized MRI screenshots (axial, coronal, sagittal; 256 × 256 pixels) from T1- or T2-weighted sequences. Using its integrated visual encoder, GPT-4o performed direct parsing of the orbital anatomy, automatically segmenting key structures (globe, optic nerve, extraocular muscles, periorbital fat) and the lesion itself. According to the textual prompt, the model estimated lesion size, boundaries, signal intensity patterns, and anatomical relationships with nearby structures. Such visual features were automatically tokenized by GPT-4o, to be then handled by the transformer-based language module to compose a structured, textual, radiology report, including sections on anatomy, lesion description, and diagnostic impression. This fully end-to-end workflow, relying exclusively on OpenAI's built-in computer vision and language capabilities without external pipelines, removed the need for separate CNNs, vision transformers, or external feature encoders, ensuring a uniform and reproducible process across all cases.
For LLM analysis, ChatGPT 4o was used to compare the performance of the LLM-generated report with a board-certified radiologist expert in orbital imaging. The evaluation was carried out across two levels: linguistic similarity between LLM and human-generated reports and medical report recognition.
Linguistic similarity between LLM-generated and expert reports was first quantified using three validated natural language processing (NLP) metrics, which are briefly summarized as follows:
Radiological report structure and recognition of medical concepts was based on different evaluation metrics, which included:
In addition to automated evaluations, three additional parameters reviewed by expert clinicians were developed:
To simplify the evaluation, such parameters were semantically categorized in the following classes: very low (0-0.3); low (0.3-0.5); moderate (0.5-0.7), high (0.7-1). Each case was independently assessed by five maxillofacial surgeons for clinical completeness, diagnostic accuracy, and clinical utility. Inter-observer agreement was quantified using Cohen's kappa, which demonstrated substantial concordance (κ = 0.83, 0.87, and 0.84, respectively). Thus, no formal adjudication was performed, as the high level of agreement obviated the need for a third observer consensus review.