Can artificial intelligence accurately detect and summarize anatomy education literature? A comparative analysis of ChatGPT and ScholarGPT - 27/11/25

Doi : 10.1016/j.morpho.2025.101061

D. Chytas ^a,^b,^⁎ , S. Kanakaris ^c, M. Piagkou ^c, I. Chryssanthou ^c, A.V. Vasiliadis ^d, K. Natsis ^e

^a Basic Sciences Laboratory, Department of Physiotherapy, University of Peloponnese, 20, Plateon Str., 23100 Sparta, Greece

^b European University of Cyprus, Engomi, Nicosia, Cyprus

^c Department of Anatomy, School of Medicine, National and Kapodistrian University of Athens, Athens, Greece

^d Department of Orthopaedic Surgery, St. Luke's Hospital, Panorama, Thessaloniki, Greece

^e Department of Anatomy and Surgical Anatomy, School of Medicine, Aristotle University of Thessaloniki, Thessaloniki, Greece

^⁎ Corresponding author at: Basic Sciences Laboratory, Department of Physiotherapy, University of Peloponnese, 20, Plateon Str., 23100 Sparta, Greece. Basic Sciences Laboratory, Department of Physiotherapy, University of Peloponnese 20, Plateon Str. Sparta 23100 Greece

Highlights

•	ChatGPT and ScholarGPT correctly detected relevant anatomy education studies only at the first and simplest level of questions complexity.
•	At the next two levels, ScholarGPT performed slightly better, but still did not surpass 50% in terms of accuracy.
•	Regarding most of the relevant studies, the summaries lacked important information, while it seemed that there was bias in favor of the use of the educational intervention.
•	ChatGPT and ScholarGPT are not currently at an adequate level to essentially aid researchers to detect and summarize studies of the anatomy education literature.

Le texte complet de cet article est disponible en PDF.

Summary

Purpose

Artificial intelligence platforms have been suggested as tools that can facilitate anatomy teachers’ work and students’ learning process. We aimed to investigate the ability of ChatGPT to detect and summarize studies of the anatomy education literature compared to ScholarGPT, a version of ChatGPT specified in academic research. Secondly, we aimed to explore if the ability of each platform is influenced by the level of queries complexity.

Methods

We asked the two platforms to list five studies about each of the following three topics: (1) use of virtual reality in anatomy education, (2) use of stereoscopic virtual reality in anatomy education, (3) use of stereoscopic virtual reality in anatomy education, involving user's interaction with the virtual environment. We assessed if the retrieved studies fulfilled the search criteria, and if their summaries were accurate (if they contained true information about all the educational results of the article's abstract).

Results

The ChatGPT's percentages of successful detection were 100%, 60% and 0% respectively for the three queries. The percentages of accurate summaries were 60%, 20% and 0% respectively. ScholarGPT performed better, with a percentage of successful detection 100%, 60% and 40% respectively. The percentages of accurate summaries were 80%, 60% and 40% respectively. Both platforms showed bias in favor of the educational intervention.

Conclusions

ChatGPT and ScholarGPT are not currently at an adequate level to essentially aid researchers to detect and summarize studies of the anatomy education literature. Ongoing research may increase the ability of those platforms to provide more reliable information.

Le texte complet de cet article est disponible en PDF.

Keywords : ChatGPT, ScholarGPT, Artificial intelligence, Virtual reality, Anatomy, Anatomy education