Performance of Generative Large Language Models on Ophthalmology Board–Style Questions - 22/09/23

Doi : 10.1016/j.ajo.2023.05.024

Louis Z. Cai ^a,^{^{1
Louis Z. Cai and Abdulla Shaheen contributed equally as co-first authors.},}^⁎ , Abdulla Shaheen ^a,^{^{1
Louis Z. Cai and Abdulla Shaheen contributed equally as co-first authors.}}, Andrew Jin ^b, Riya Fukui ^c, Jonathan S. Yi ^a, Nicolas Yannuzzi ^a, Chrisfouad Alabiad ^a
^a From the Bascom Palmer Eye Institute, Miami, Florida, USA (L.Z.C., A.S., J.S.Y., N.Y., C.A.)
^b Yale Eye Center, New Haven, Connecticut, USA (A.J.)
^c Houston Rehabilitation Group, Houston, Texas, USA (R.F.)

^⁎Inquiries to Louis Z. Cai, Bascom Palmer Eye Institute, Retina and Vitreous Diseases, 900 NW 17th St, Miami, FL.Bascom Palmer Eye InstituteRetina and Vitreous Diseases900 NW 17th StMiamiFL

Résumé

PURPOSE

To investigate the ability of generative artificial intelligence models to answer ophthalmology board–style questions.

DESIGN

Experimental study.

METHODS

This study evaluated 3 large language models (LLMs) with chat interfaces, Bing Chat (Microsoft) and ChatGPT 3.5 and 4.0 (OpenAI), using 250 questions from the Basic Science and Clinical Science Self-Assessment Program. Although ChatGPT is trained on information last updated in 2021, Bing Chat incorporates a more recently indexed internet search to generate its answers. Performance was compared with human respondents. Questions were categorized by complexity and patient care phase, and instances of information fabrication or nonlogical reasoning were documented.

MAIN OUTCOME MEASURES

Primary outcome was response accuracy. Secondary outcomes were performance in question subcategories and hallucination frequency.

RESULTS

Human respondents had an average accuracy of 72.2%. ChatGPT-3.5 scored the lowest (58.8%), whereas ChatGPT-4.0 (71.6%) and Bing Chat (71.2%) performed comparably. ChatGPT-4.0 excelled in workup-type questions (odds ratio [OR], 3.89, 95% CI, 1.19-14.73, P = .03) compared with diagnostic questions, but struggled with image interpretation (OR, 0.14, 95% CI, 0.05-0.33, P < .01) when compared with single-step reasoning questions. Against single-step questions, Bing Chat also faced difficulties with image interpretation (OR, 0.18, 95% CI, 0.08-0.44, P < .01) and multi-step reasoning (OR, 0.30, 95% CI, 0.11-0.84, P = .02). ChatGPT-3.5 had the highest rate of hallucinations and nonlogical reasoning (42.4%), followed by ChatGPT-4.0 (18.0%) and Bing Chat (25.6%).

CONCLUSIONS

LLMs (particularly ChatGPT-4.0 and Bing Chat) can perform similarly with human respondents answering questions from the Basic Science and Clinical Science Self-Assessment Program. The frequency of hallucinations and nonlogical reasoning suggests room for improvement in the performance of conversational agents in the medical domain.

Le texte complet de cet article est disponible en PDF.

Plan

METHODS

RESULTS

DISCUSSION

Supplemental Material available at AJO.com.

Export

Vol 254

P. 141-149 - octobre 2023 Retour au numéro

Article précédent

Motion-Tracking Brillouin Microscopy Evaluation of Normal, Keratoconic, and Post–Laser Vision Correction Corneas
Hongyuan Zhang, Lara Asroui, Imane Tarib, William J. Dupps, Giuliano Scarcelli, J. Bradley Randleman

| Article suivant

Natural History of Optic Disc With Physiologic Large Cup: Incidence, Predictors of Glaucoma Conversion After Minimum 10-Year Follow-up
Sooyeon Choe, Young Kook Kim, Ki Ho Park, Hyuk Jin Choi, Jin Wook Jeoung

Bienvenue sur EM-consulte, la référence des professionnels de santé.
L’accès au texte intégral de cet article nécessite un abonnement.

Déjà abonné à cette revue ?

connectez-vous ou créez un compte

Performance of Generative Large Language Models on Ophthalmology Board–Style Questions - 22/09/23

Résumé

PURPOSE

DESIGN

METHODS

MAIN OUTCOME MEASURES

RESULTS

CONCLUSIONS

Plan

Export citations

Fichier

Contenu

Accès rapides

Mon compte

Aide & support

Plateformes Elsevier Masson

Déclaration CNIL