Bilingual comparison of the performance of GPT-4o and GPT-4 on ophthalmology residency examination questions - 04/10/25

Comparaison bilingue des performances des examens de résidence en ophtalmologie du GPT-4o et du GPT-4

Doi : 10.1016/j.jfo.2025.104650

E. Shvartz ^a,^{^{1
Elad Shvartz and Leah Attal contributed equally as first co-authors to this research.}} , L. Attal ^a,^{^{1
Elad Shvartz and Leah Attal contributed equally as first co-authors to this research.}}, O. Zur ^a, Z. Nujeidat ^b, G. Plopsky ^c,^d, D. Bahir ^a,^b,^⁎

^a Azrieli Faculty of Medicine, Bar Ilan University, Safed, Israel

^b Ophthalmology Department, Tzafon Medical Center, Poriya, Israel

^c Department of Ophthalmology, Samson Assuta Ashdod Hospital, Ashdod, Israel

^d Faculty of Health Sciences, Ben-Gurion University of the Negev, Israel

^⁎ Corresponding author. The Tzafon Medical Center, General government hospital 768, The Baruch Padeh Medical Center, Poriya M.P., The lower Galilee, 15208 Poriya, Israel. The Tzafon Medical Center, General government hospital 768, The Baruch Padeh Medical Center, Poriya M.P., The lower Galilee Poriya 15208 Israel

Summary

Objective

To evaluate and compare the performance of GPT-4 and the newer GPT-4o in both English and French on ophthalmology board examination questions, assessing accuracy across various subspecialties and question formats, with a focus on image analysis.

Methods

A dataset of 600 multiple-choice questions from certification-level board examinations covering 12 subspecialties and diverse content was carefully translated and tested in both English and French using GPT-4 and GPT-4o with analyses by examination years, question type, and processing of various image inputs, ensuring a comprehensive evaluation. Performance of human residents from 2021–2023 was used for comparison. Statistical analyses, including χ ² tests and odds ratio calculations, compared accuracy across models.

Results

GPT-4o in English achieved the highest accuracy (74.5%), approaching human resident performance, while its French counterpart scored 67.4%. GPT-4 scored 62.3% and 64.4% in English and French, respectively, both significantly lower than GPT-4o ( P < 0.001). Text-based questions showed consistently higher accuracy across all models, with English GPT-4o leading at 82.5%. Image-based questions revealed similar performance for English and French GPT-4o, both outperforming the GPT-4 models.

Conclusions

GPT-4o outperforms GPT-4 in both English and French, underscoring its potential for ophthalmology use in both languages. While limitations remain, particularly in image-based diagnostics and language-specific nuances, these models are paving the way for a future where artificial intelligence supports and enhances human expertise in both education and patient care.

Le texte complet de cet article est disponible en PDF.

Résumé

Objectif

Évaluer et comparer les performances de GPT-4 et du nouveau GPT-4o en anglais et en français sur des questions tirées d’examens de certification en ophtalmologie pour l’internat, en analysant leur précision à travers différentes sous-spécialités et formats de questions, avec une attention particulière à l’analyse d’images.

Méthodes

Un ensemble de 600 questions à choix multiples issues d’examens de certification, couvrant 12 sous-spécialités et divers contenus, a été soigneusement traduit et testé en anglais et en français à l’aide de GPT-4 et GPT-4o, avec une analyse par année d’examen, type de question et différents types d’images évaluées, assurant une évaluation complète. Les performances des internes humains (2021–2023) ont servi de référence. Des analyses statistiques, incluant des tests du χ ² et des calculs de rapport de cotes, ont permis de comparer la précision des modèles.

Résultats

GPT-4o en anglais a démontré la meilleure précision (74,5 %), se rapprochant des performances des internes humains, tandis que sa version française a atteint 67,4 %. GPT-4 a obtenu 62,3 % en anglais et 64,4 % en français, des résultats significativement inférieurs à GPT-4o ( p < 0,001). Les questions textuelles ont montré une précision constamment plus élevée pour tous les modèles, GPT-4o en anglais étant en tête avec 82,5 %. Les questions accompagnées d’images ont révélé des performances similaires pour GPT-4o en anglais et en français, tous deux surpassant le modèle GPT-4.

Conclusions

GPT-4o surpasse GPT-4 en anglais et en français, soulignant son potentiel d’utilisation en ophtalmologie dans les deux langues. En dépit des limites persistantes, notamment dans les diagnostics basés sur les images et les nuances propres aux langues, ces modèles ouvrent la voie à un futur où l’intelligence artificielle soutient et améliore l’expertise humaine, tant dans l’éducation que dans le soin au patient.

Le texte complet de cet article est disponible en PDF.

Keywords : Large language models, GPT-4o, GPT-4, Ophthalmology, AI in medicine, Bilingual education

Mots clés : Grands modèles de langage, GPT-4o, GPT-4, Ophtalmologie, IA en médecine, Éducation bilingue

Plan

Export

Vol 48 - N° 9

Article 104650- novembre 2025 Retour au numéro

Article précédent

Is cooling effective before upper lid blepharoplasty?
B. Isık, M. Suleymanzade

| Article suivant

Efficacy and safety of micropulse versus continuous wave transscleral diode cyclophotocoagulation in treating glaucoma
A. Gaulier, B. Vabres, C. Cornée, G. Le Meur, P. Lebranchu, J.-B. Ducloyer, I. Orignac

Bienvenue sur EM-consulte, la référence des professionnels de santé.
L’accès au texte intégral de cet article nécessite un abonnement.

Déjà abonné à cette revue ?

connectez-vous ou créez un compte

Bilingual comparison of the performance of GPT-4o and GPT-4 on ophthalmology residency examination questions - 04/10/25

Comparaison bilingue des performances des examens de résidence en ophtalmologie du GPT-4o et du GPT-4

Summary

Objective

Methods

Results

Conclusions

Résumé

Objectif

Méthodes

Résultats

Conclusions

Plan

Export citations

Fichier

Contenu

Accès rapides

Mon compte

Aide & support

Plateformes Elsevier Masson

Déclaration CNIL