External validation of a commercially available deep learning algorithm for fracture detection in children - 20/11/21

Doi : 10.1016/j.diii.2021.10.007

Michel Dupuis ^a, Léo Delbos ^b, Raphael Veil ^b,^{^{1
R. V. and C.A. contributed equally to this work and share last co-authorship.}}, Catherine Adamsbaum ^a,^c,^{^{1
R. V. and C.A. contributed equally to this work and share last co-authorship.},}^⁎
^a AP-HP, Bicêtre Hospital, Pediatric Imaging Department, 94270 Le Kremlin Bicêtre, France
^b AP-HP, Bicêtre Hospital, Epidemiology and Public Health Department, 94270 Le Kremlin Bicêtre, France
^c Paris Saclay University, Faculty of Medicine, 94270 Le Kremlin Bicêtre, France

^⁎Corresponding author.

Sous presse. Épreuves corrigées par l'auteur. Disponible en ligne depuis le Saturday 20 November 2021
Cet article a été publié dans un numéro de la revue, cliquez ici pour y accéder

Highlights

•	Deep learning algorithms lack real-world external validation prior to clinical use.
•	The tested deep learning algorithm shows strong diagnostic performance in children.
•	Sensitivity of the tested algorithm is lower in children under 4 years.

Le texte complet de cet article est disponible en PDF.

Abstract

Purpose

The purpose of this study was to conduct an external validation of a fracture assessment deep learning algorithm (Rayvolve®) using digital radiographs from a real-life cohort of children presenting routinely to the emergency room.

Materials and methods

This retrospective study was conducted on 2634 radiography sets (5865 images) from 2549 children (1459 boys, 1090 girls; mean age, 8.5 ± 4.5 [SD] years; age range: 0–17 years) referred by the pediatric emergency room for trauma. For each set was recorded whether one or more fractures were found, the number of fractures, and their location found by the senior radiologists and the algorithm. Using the senior radiologist diagnosis as the standard of reference, the diagnostic performance of deep learning algorithm (Rayvolve®) was calculated via three different approaches: a detection approach (presence/absence of a fracture as a binary variable), an enumeration approach (exact number of fractures detected) and a localization approach (focusing on whether the detected fractures were correctly localized). Subgroup analyses were performed according to the presence of a cast or not, age category (0–4 vs. 5–18 years) and anatomical region.

Results

Regarding detection approach, the deep learning algorithm yielded 95.7% sensitivity (95% CI: 94.0–96.9), 91.2% specificity (95% CI: 89.8–92.5) and 92.6% accuracy (95% CI: 91.5–93.6). Regarding enumeration and localization approaches, the deep learning algorithm yielded 94.1% sensitivity (95% CI: 92.1–95.6), 88.8% specificity (95% CI: 87.3–90.2) and 90.4% accuracy (95% CI: 89.2–91.5) for both approaches. Regarding age-related subgroup analyses, the deep learning algorithm yielded greater sensitivity and negative predictive value in the 5–18-years age group than in the 0–4-years age group for the detection approach (P < 0.001 and P = 0.002) and for the enumeration and localization approaches (P = 0.012 and P = 0.028). The high negative predictive value was robust, persisting in all of the subgroup analyses, except for patients with casts (P = 0.001 for the detection approach and P < 0.001 for the enumeration and localization approaches).