Reference standard for the evaluation of automatic segmentation algorithms: Quantification of inter observer variability of manual delineation of prostate contour on MRI - 31/01/24
, Dimitri Hamzaoui d, 1, Benjamin Granger e, Sarah Montagne f, g, h, Alexandre Allera g, Malek Ezziane g, Anna Luzurier g, Raphaelle Quint g, Mehdi Kalai g, Nicholas Ayache a, Hervé Delingette a, Raphaële Renard-Penna f, g, hHighlights |
• | The number of readers affects the consistency and conformity of prostate segmentation on MRI. |
• | Inter-rater consistency shows a tipping point with three readers, and this number also marks a tipping point in the evolution of consensus segmentation volume according to the number of readers. |
• | Prostate segmentations exhibit maximum conformity to a reference with three readers. |
• | Three readers may be an optimal number of raters to consider for references for artificial intelligence applications for prostate segmentation. |
Abstract |
Purpose |
The purpose of this study was to investigate the relationship between inter-reader variability in manual prostate contour segmentation on magnetic resonance imaging (MRI) examinations and determine the optimal number of readers required to establish a reliable reference standard.
Materials and methods |
Seven radiologists with various experiences independently performed manual segmentation of the prostate contour (whole-gland [WG] and transition zone [TZ]) on 40 prostate MRI examinations obtained in 40 patients. Inter-reader variability in prostate contour delineations was estimated using standard metrics (Dice similarity coefficient [DSC], Hausdorff distance and volume-based metrics). The impact of the number of readers (from two to seven) on segmentation variability was assessed using pairwise metrics (consistency) and metrics with respect to a reference segmentation (conformity), obtained either with majority voting or simultaneous truth and performance level estimation (STAPLE) algorithm.
Results |
The average segmentation DSC for two readers in pairwise comparison was 0.919 for WG and 0.876 for TZ. Variability decreased with the number of readers: the interquartile ranges of the DSC were 0.076 (WG) / 0.021 (TZ) for configurations with two readers, 0.005 (WG) / 0.012 (TZ) for configurations with three readers, and 0.002 (WG) / 0.0037 (TZ) for configurations with six readers. The interquartile range decreased slightly faster between two and three readers than between three and six readers. When using consensus methods, variability often reached its minimum with three readers (with STAPLE, DSC = 0.96 [range: 0.945–0.971] for WG and DSC = 0.94 [range: 0.912–0.957] for TZ, and interquartile range was minimal for configurations with three readers.
Conclusion |
The number of readers affects the inter-reader variability, in terms of inter-reader consistency and conformity to a reference. Variability is minimal for three readers, or three readers represent a tipping point in the variability evolution, with both pairwise-based metrics or metrics with respect to a reference. Accordingly, three readers may represent an optimal number to determine references for artificial intelligence applications.
Le texte complet de cet article est disponible en PDF.Keywords : Artificial intelligence, Inter-reader variability, Magnetic resonance imaging, Prostate, Segmentation
Abbreviations : 3D, AI, ASSD, DSC, HD, HD95, IQR, MRI, PCa, PSA, STAPLE, TZ, WG
Plan
Vol 105 - N° 2
P. 65-73 - février 2024 Retour au numéroBienvenue sur EM-consulte, la référence des professionnels de santé.
