Comparison between multimodal foundation models and radiologists for the diagnosis of challenging neuroradiology cases with text and images - 07/10/25
, Cyril Bruge d, Najib Chalhoub a, Victor Chaton e, Edouard De Sousa a, Yann Gaillandre a, Riyad Hanafi a, Matthieu Masy f, Quentin Vannod-Michel a, Aghiles Hamroun b, g, Grégory Kuchcinski a, con behalf of
ARIANES investigators
Highlights |
• | Multimodal models (GPT-4o and Gemini 1.5 Pro) outperform neuroradiologists in suggesting diagnoses from clinical context alone (34.0 % and 44.7 % vs. 16.4 %, respectively; P < 0.01). |
• | Neuroradiologists outperform multimodal models (GPT-4o and Gemini 1.5 Pro) using images alone (42.0 % vs. 3.8 % and 7.5 %; P < 0.01) and images and text (48.0 % vs. 34.0 % and 38.7 %; P < 0.001). |
• | The multimodal models have limitations in identifying abnormal findings, with frequent hallucinations, and fail to effectively integrate multimodal inputs. |
• | Neuroradiologists improve their accuracy with the assistance of Gemini 1.5 Pro from 47.2 % to 56.0 % (P < 0.01). |
Abstract |
Purpose |
The purpose of this study was to compare the ability of two multimodal models (GPT-4o and Gemini 1.5 Pro) with that of radiologists to generate differential diagnoses from textual context alone, key images alone, or a combination of both using complex neuroradiology cases.
Materials and methods |
This retrospective study included neuroradiology cases from the "Diagnosis Please" series published in the Radiology journal between January 2008 and September 2024. The two multimodal models were asked to provide three differential diagnoses from textual context alone, key images alone, or the complete case. Six board-certified neuroradiologists solved the cases in the same setting, randomly assigned to two groups: context alone first and images alone first. Three radiologists solved the cases without, and then with the assistance of Gemini 1.5 Pro. An independent radiologist evaluated the quality of the image descriptions provided by GPT-4o and Gemini for each case. Differences in correct answers between multimodal models and radiologists were analyzed using McNemar test.
Results |
GPT-4o and Gemini 1.5 Pro outperformed radiologists using clinical context alone (mean accuracy, 34.0 % [18/53] and 44.7 % [23.7/53] vs. 16.4 % [8.7/53]; both P < 0.01). Radiologists outperformed GPT-4o and Gemini 1.5 Pro using images alone (mean accuracy, 42.0 % [22.3/53] vs. 3.8 % [2/53], and 7.5 % [4/53]; both P < 0.01) and the complete cases (48.0 % [25.6/53] vs. 34.0 % [18/53], and 38.7 % [20.3/53]; both P < 0.001). While radiologists improved their accuracy when combining multimodal information (from 42.1 % [22.3/53] for images alone to 50.3 % [26.7/53] for complete cases; P < 0.01), GPT-4o and Gemini 1.5 Pro did not benefit from the multimodal context (from 34.0 % [18/53] for text alone to 35.2 % [18.7/53] for complete cases for GPT-4o; P = 0.48, and from 44.7 % [23.7/53] to 42.8 % [22.7/53] for Gemini 1.5 Pro; P = 0.54). Radiologists benefited significantly from the suggestion of Gemini 1.5 Pro, increasing their accuracy from 47.2 % [25/53] to 56.0 % [27/53] (P < 0.01). Both GPT-4o and Gemini 1.5 Pro correctly identified the imaging modality in 53/53 (100 %) and 51/53 (96.2 %) cases, respectively, but frequently failed to identify key imaging findings (43/53 cases [81.1 %] with incorrect identification of key imaging findings for GPT-4o and 50/53 [94.3 %] for Gemini 1.5).
Conclusion |
Radiologists show a specific ability to benefit from the integration of textual and visual information, whereas multimodal models mostly rely on the clinical context to suggest diagnoses.
Il testo completo di questo articolo è disponibile in PDF.Keywords : Artificial intelligence, ChatGPT, Gemini, Large language models, Multimodal models
Abbreviations : CT, GPT, LLM, MRI, SD
Mappa
Vol 106 - N° 10
P. 345-352 - ottobre 2025 Ritorno al numeroBenvenuto su EM|consulte, il riferimento dei professionisti della salute.
L'accesso al testo integrale di questo articolo richiede un abbonamento.
Già abbonato a @@106933@@ rivista ?
