Sepsis as Seen through the Eyes of AI: A Comparative evaluation of ChatGPT and Gemini - 24/12/25
, Rahmet Güner bHighlights |
• | Cross-sectional benchmarking of ChatGPT-4o vs Gemini 2.5 Flash on 82 sepsis questions (FAQ + SSC) was carried out. |
• | Dual infectious-disease raters used the Global Quality Scale; reproducibility was tested by repeat queries. |
• | Gemini delivered markedly higher quality (GQS-5 in 94% vs 35.4% for ChatGPT) and greater reproducibility (97.5% vs 76.5% overall). |
• | The two models underperformed in the “prevention” domain, indicating a common weakness. |
• | The findings support rigorous benchmarking and domain-specific optimization prior to public deployment. |
Abstract |
Introduction |
More and more people are using large language models (LLMs) to seek out health information online. Although these tools have great potential to improve digital health literacy, not enough is known about their accuracy and consistency, especially in life-threatening conditions such as sepsis. The aim of this study was to test and compare the effectiveness of two popular LLMs, ChatGPT 4o and Gemini 2.5 Flash, in providing accurate and consistent answers to questions about sepsis.
Material and Methods |
A cross-sectional benchmarking study was conducted using a standardized set of sepsis-related questions, comprising two main categories: frequently asked questions (FAQs) and items drawn from the Surviving Sepsis Campaign (SSC) guidelines. The responses generated by the two models were independently assessed by two raters using the Global Quality Score (GQS), and reproducibility was evaluated by submitting each question twice.
Results |
Gemini significantly outperformed ChatGPT in overall quality and reproducibility. More specifically, 94% of Gemini’s responses received the highest GQS rating (GQS 5), compared to only 35.4% of the ChatGPT answers. Gemini also demonstrated higher reproducibility (97.5% vs. 76.5%). Both models underperformed in the “prevention” domain. Gemini showed greater potential than ChatGPT in delivering accurate and consistent sepsis-related health information, which is crucial for patients and caregivers alike.
Conclusion |
These findings underscore the importance of rigorous benchmarking before integrating LLMs into digital health platforms, and illustrate a need for refinement of LLMs to enhance their reliability in public-facing health communication.
Le texte complet de cet article est disponible en PDF.Keywords : Artificial intelligence, Sepsis, ChatGPT, Gemini, Large language model
Plan
Vol 56 - N° 1
Article 105228- janvier 2026 Retour au numéroBienvenue sur EM-consulte, la référence des professionnels de santé.
L’accès au texte intégral de cet article nécessite un abonnement.
Déjà abonné à cette revue ?
