Creating a standardized tool for the evaluation and comparison of artificial intelligence–based computer-aided detection programs in colonoscopy: a modified Delphi approach - 14/06/25
, Yuichi Mori, MD, PhD 2, 3, 4, Masashi Misawa, MD, PhD 2, James E. East, MD 5, 6, Cesare Hassan, MD, PhD 7, 8, Alessandro Repici, MD 7, 8, Michael F. Byrne, MD 9, 10, Daniel von Renteln, MD 11, David G. Hewett, MBBS, PhD, MSc 12, Pu Wang, MD 13, Yutaka Saito, MD, PhD 14, Carolina Ogawa Matsubayashi, MD 15, 16, Omer F. Ahmad, MBBS 17, Prateek Sharma, MBBS 18, Seth A. Gross, MD 19, Neil Sengupta, MD 20, Nabil Mansour, MD 21, Andrea Cherubini, PhD 22, Nhan Ngo Dinh 22, Xiao Xiao, PhD 23, Peter Mountney, PhD 24, 25, Juana González-Bueno Puyal, PhD 24, 25, Greg Little, MBA 25, Shawn LaRocco, MBA 25, Sailesh Conjeti, PhD 25, Hannes Seibt, MS 26, Dror Zur, PhD 27, Hitoshi Shimada, BEE 28, Tyler M. Berzin, MD ∗, 29, Jeremy R. Glissen Brown, MD, MSc ∗, 30Abstract |
Background and Aims |
Multiple computer-aided detection (CADe) software programs have now achieved regulatory approval in the United States, Europe, and Asia and are being used in routine clinical practice to support colorectal cancer screening. There is uncertainty regarding how different CADe algorithms may perform. No objective methodology exists for comparing different algorithms. We aimed to identify priority scoring metrics for CADe evaluation and comparison.
Methods |
A modified Delphi approach was used. Twenty-five global leaders in CADe in colonoscopy, including endoscopists, researchers, and industry representatives, participated in an online survey over the course of 8 months. Participants generated 121 scoring criteria, 54 of which were deemed within the study scope and distributed for review and asynchronous e-mail–based open comment. Participants then scored criteria in order of priority on a 5-point Likert scale during ranking round 1. The top 11 highest priority criteria were re-distributed, with another opportunity for open comment, followed by a final round of priority scoring to identify the final 6 criteria.
Results |
Mean priority scores for the 54 criteria ranged from 2.25 to 4.38 after the first ranking round. The top 11 criteria after round 1 of ranking yielded mean priority scores ranging from 3.04 to 4.16. The final 6 highest priority criteria, including a tie for first-place ranking, were (1, tied) sensitivity (average, 4.16) and (1, tied) separate and independent validation of the CADe algorithm (average, 4.16); (3) adenoma detection rate (average, 4.08); (4) false-positive rate (average, 4.00); (5) latency (average, 3.84); and (6) adenoma miss rate (average, 3.68).
Conclusions |
This is the first reported international consensus statement of priority scoring metrics for CADe in colonoscopy. These scoring criteria should inform CADe software development and refinement. Future research should validate these metrics on a benchmark video dataset to develop a validated scoring instrument.
Le texte complet de cet article est disponible en PDF.Abbreviations : ADR, AI, AMR, CADe, CRC, dBox, GTBox, IoU, LIS, NLIS, SSL
Plan
| DIVERSITY, EQUITY, AND INCLUSION: One or more of the authors of this paper self-identifies as an under-represented gender minority in science. One or more of the authors of this paper self-identifies as an under-represented ethnic minority in science. The author list of this paper includes contributors from the location where the research was conducted who participated in the data collection, design, analysis, and/or interpretation of the work. |
Vol 102 - N° 1
P. 109 - juillet 2025 Retour au numéroBienvenue sur EM-consulte, la référence des professionnels de santé.
L’accès au texte intégral de cet article nécessite un abonnement.
Déjà abonné à cette revue ?
