Subscribe to RSS
DOI: 10.1055/a-1290-8070
Comparison of Prostate MRI Lesion Segmentation Agreement Between Multiple Radiologists and a Fully Automatic Deep Learning System
Vergleich der Kongruenz von Prostata-MRT-Läsionssegmentationen durch mehrere Radiologen und ein vollautomatisches Deep-Learning-SystemAbstract
Purpose A recently developed deep learning model (U-Net) approximated the clinical performance of radiologists in the prediction of clinically significant prostate cancer (sPC) from prostate MRI. Here, we compare the agreement between lesion segmentations by U-Net with manual lesion segmentations performed by different radiologists.
Materials and Methods 165 patients with suspicion for sPC underwent targeted and systematic fusion biopsy following 3 Tesla multiparametric MRI (mpMRI). Five sets of segmentations were generated retrospectively: segmentations of clinical lesions, independent segmentations by three radiologists, and fully automated bi-parametric U-Net segmentations. Per-lesion agreement was calculated for each rater by averaging Dice coefficients with all overlapping lesions from other raters. Agreement was compared using descriptive statistics and linear mixed models.
Results The mean Dice coefficient for manual segmentations showed only moderate agreement at 0.48–0.52, reflecting the difficult visual task of determining the outline of otherwise jointly detected lesions. U-net segmentations were significantly smaller than manual segmentations (p < 0.0001) and exhibited a lower mean Dice coefficient of 0.22, which was significantly lower compared to manual segmentations (all p < 0.0001). These differences remained after correction for lesion size and were unaffected between sPC and non-sPC lesions and between peripheral and transition zone lesions.
Conclusion Knowledge of the order of agreement of manual segmentations of different radiologists is important to set the expectation value for artificial intelligence (AI) systems in the task of prostate MRI lesion segmentation. Perfect agreement (Dice coefficient of one) should not be expected for AI. Lower Dice coefficients of U-Net compared to manual segmentations are only partially explained by smaller segmentation sizes and may result from a focus on the lesion core and a small relative lesion center shift. Although it is primarily important that AI detects sPC correctly, the Dice coefficient for overlapping lesions from multiple raters can be used as a secondary measure for segmentation quality in future studies.
Key Points:
-
Intermediate human Dice coefficients reflect the difficulty of outlining jointly detected lesions.
-
Lower Dice coefficients of deep learning motivate further research to approximate human perception.
-
Comparable predictive performance of deep learning appears independent of Dice agreement.
-
Dice agreement independent of significant cancer presence indicates indistinguishability of some benign imaging findings.
-
Improving DWI to T2 registration may improve the observed U-Net Dice coefficients.
Citation Format
-
Schelb P, Tavakoli AA, Tubtawee T et al. Comparison of Prostate MRI Lesion Segmentation Agreement Between Multiple Radiologists and a Fully Automatic Deep Learning System. Fortschr Röntgenstr 2021; 193: 559 – 573
#
Zusammenfassung
Ziel Ein kürzlich eigens entwickeltes künstliches neuronales Netzwerk (U-Net) zeigte eine gute und mit klinischer radiologischer Befundung vergleichbare Erkennungsrate klinisch signifikanter Prostatakarzinome (sPC). In dieser Arbeit wird nun die Kongruenz der durch U-Net und mehrere Radiologen erstellten Läsionsvolumina (der Segmentationen) verglichen.
Materialien und Methoden 165 Patienten mit Verdacht auf sPC erhielten eine multiparametrische MRT (mpMRT) bei 3 Tesla, gefolgt von gezielter und systematischer MR/TRUS-Fusionsbiopsie. Fünf Segmentationen pro Untersuchung wurden erstellt: Segmentationen klinischer Läsionen, unabhängige und geblindete retrospektive PI-RADS-Befundung durch 3 Radiologen und U-Net. Die läsionsbasierte Übereinstimmung für jeden Befunder wurde durch den Dice-Koeffizienten mit überlappenden Läsionen anderer Befunder bestimmt. Die Übereinstimmung wurde durch deskriptive Statistik und lineare gemischte Modelle verglichen.
Ergebnisse Der mittlere Dice-Koeffizient war für Radiologen mit 0,48–0,52 nur moderat kongruent als Ausdruck der schwierigen visuellen Aufgabe, die Begrenzung sonst übereinstimmend detektierter Läsionen zu bestimmen. U-Net-Segmentationen waren signifikant kleiner als manuelle Segmentationen (p < 0,0001) und zeigten einen geringeren mittleren Dice-Koeffizienten von 0,22, signifikant kleiner als manuelle Segmentationen (alle p < 0,0001). Diese Unterschiede blieben nach Adjustierung für die Segmentationsgröße bestehen und wurden nicht durch das Vorliegen eines sPC oder eine zonale Lokalisation in der peripheren oder Transitionszone beeinflusst.
Schlussfolgerung Die Kenntnis der Größenordnung der Übereinstimmung manueller Segmentationen verschiedener Radiologen ist wichtig, um den Erwartungswert für Künstliche-Intelligenz (KI) -Ansätze festzulegen. Eine perfekte Übereinstimmung (Dice-Koeffizient von 1) sollte für KI nicht erwartet werden. Die geringeren Dice-Koeffizienten des U-Nets werden nur teilweise durch die geringere Segmentationsgröße des U-Nets erklärt, was durch eine Fokussierung des U-Nets auf den Läsionskern und eine geringe Verschiebung des Läsionszentrums erklärt werden könnte. Obwohl primär die korrekte Detektion von sPC durch KI wichtig ist, kann der Dice-Koeffizient mit multiplen Befundern als sekundäres Qualitätsmaß in zukünftigen Studien herangezogen werden.
Kernaussagen:
-
Intermediäre Dice-Koeffizienten der Radiologen reflektieren die Schwierigkeit der übereinstimmenden Festlegung der Berandung gemeinsam detektierter Läsionen.
-
Die beobachteten geringeren Dice-Koeffizienten motivieren die Weiterentwicklung von Deep Learning Systemen mit dem Ziel der besseren Approximation menschlicher Perzeption.
-
Eine vergleichbare Prädiktion des klinisch signifikanten Prostatakarzinoms erscheint unabhängig von der Übereinstimmung der Dice-Koeffizienten.
-
Die Unabhängigkeit des Dice-Koeffizienten vom Vorliegen eines signifikanten Prostatakarzinoms spricht für die fehlende Unterscheidbarkeit mancher benigner von malignen Bildcharakteristika.
-
Technische Verbesserungen in der Bildregistrierung zwischen DWI und T2 können in Zukunft möglicherweise die U-Net Dice-Koeffizienten erhöhen.
#
Key words
MRI - prostate - prostate cancer - deep learning - artificial intelligence - convolutional neural networkIntroduction
Prostate MRI in combination with stereotactic MR-guided biopsies is becoming an important cornerstone of the diagnostic pathway in patients with suspicion of clinically significant prostate cancer (sPC) [1] [2]. As a result of several major clinical trials, pre-biopsy prostate MRI is increasingly utilized to better guide biopsy needles to suspicious targets and to potentially avoid biopsies in patients with unsuspicious prostate MRI [3] [4]. However, biopsy avoidance is limited by the known underestimation of multifocal lesions on prostate MRI [5] [6]. With its increased use, the demand for consistent expert-level interpretation of prostate MRI is rising. The Prostate Imaging – Reporting and Data System (PI-RADS) is the current standard in clinical prostate MRI interpretation and aims at reducing inter-reader variability and at standardizing interpretation and clinical MR protocols [7] [8]. Novel artificial intelligence approaches such as convolutional neural networks (CNN) promise to capture diagnostically decisive information directly from medical images [9] [10]. In the prostate, systems providing fully automated prostate assessment and lesion segmentation [11] or based on slice classification [12] have been developed. Other applications have utilized CNNs to evaluate predefined regions on prostate MR images [13]. It is important to ascertain that such systems have an acceptable true-negative rate in clinical screening scenarios. This requires evaluation of clinical performance in consecutive patients screened for sPC also including patients that are ultimately not diagnosed with sPC [11], rather than evaluating pre-selected cohorts such as only patients with visible MR lesions [12]. Systems that utilize bi-parametric (T2-weighted and diffusion-weighted) MRI [11] are potentially able to extract more information than systems focusing only on a single sequence [12], as the multiparametric approach to prostate MRI has long been known to benefit from the combination of high-resolution anatomical and information-rich functional imaging [14]. It is generally expected that increasing amounts of data for the training of neural networks will improve the quality and generalizability of these models. Data annotation requires significant time and resources. High-quality 3 D lesion annotations on bi-parametric sequences [11] will not be possible for extremely large data sets. As such, in medical deep learning in general, patient-level [15] or slice-level [16] annotations are being explored as especially patient-level information is readily available in hospital information systems and any AI method trained on such data could be directly applied to the largest possible retrospectively available data sets. However, before accepting end-to-end approaches that skip lesion segmentation and proceed directly to patient-level assessment [16], it is of utmost importance to ascertain that these systems in fact base their decisions on the correct portion of the data. It is possible that prostate classification systems base their patient assessment on extra-prostatic tissues that correlate to other risk factors such as age instead of on prostate tissues. Also, it can be argued whether a segmentation CNN such as the U-Net [10] is required to exactly reproduce the segmentations of a single radiologist. From a diagnostic standpoint, it would be sufficient for the system to indicate the location and presence of a lesion without agreeing fully with the extent provided by a radiologist. Also, at present, the Dice coefficient [17], a commonly used spatial overlap index, is expected to be high for clearly defined structures such as the prostate itself, but understandably lower for prostate lesions given the known inter-observer variability in lesion assessment in the difficult task of detecting suspicious lesions on a background of hyperplastic, inflammatory, and degenerative changes [18]. We hypothesized that the inter-rater variability between different radiologists will be substantially lower than perfect (Dice coefficient of one) and comparable to that of a previously developed deep learning algorithm for automatic prostate MRI assessment.
The purpose of this study was to directly compare three-dimensional lesion segmentations based on blinded retrospective PI-RADS interpretations of three radiologists with retrospective segmentations of lesions indicated in clinical reports and with automated lesion segmentations provided by a previously established U-Net.
#
Materials and Methods
Patient cohort
The examined patient cohort is part of the previously published cohort which was used to train and evaluate the U-Net architecture used in this study [11]. Inclusion criteria for the study sample were MRI performed from May 2015 to February 2016; consecutive men with a clinical indication for biopsy based on prostate-specific antigen (PSA) elevation and clinical examination or participation in our active surveillance program; imaging performed with our main institutional 3.0 Tesla MRI system; MRI-transrectal US fusion biopsy performed at our institution; detection of a focal lesion by at least one of the raters and overlap of each detected lesion with at least one lesion by another rater. The exclusion criteria were history of treatment for prostate cancer (prostatectomy, radiation therapy, antihormonal therapy, focal therapy); biopsy within the past 6 months prior to MRI; and incomplete sequences or severe MRI artifacts. The institutional and governmental ethics committee approved the study and waived informed consent.
#
MR imaging
MR images were acquired at 3 Tesla (Magnetom Prisma, Siemens Healthcare, Erlangen, Germany) using the standard multi-channel body coil and integrated spine phased-array coil, according to the European Society of Urogenital Radiology (ESUR) guidelines [19]. As per the institutional protocol, axial, coronal and sagittal T2-weighted (T2) images, echo-planar imaging (EPI) diffusion-weighted images (DWI) and dynamic-contrast enhanced (DCE) images were acquired. Clinical interpretation by board certified radiologists included PI-RADS assessment for each detected lesion and a pictogram indicating lesion location [8] [20].
#
Systematic and targeted MRI/TRUS-fusion biopsies
All men underwent transperineal grid-directed biopsy performed under general anesthesia with rigid software registration using the BiopSee system (MEDCOM, Darmstadt, Germany). Targeted fusion-biopsy (FTB) of MRI-suspicious lesions (inter-quartile range (IQR) 4–8 cores, median 6 per lesion) was followed by systematic saturation biopsy (20–29 cores, median 24 cores), as previously described [1] [21]. This extended targeted and systematic biopsy approach has been validated against and confirmed to be concordant with RP specimens [21]. A median of 30 biopsies (IQR 24–37) were taken per patient with the number of biopsies adjusted to prostate volume [22]. Systematic biopsies were collected from sextants according to the Ginsburg Study group protocol [22].
#
Manual and U-Net MR lesion segmentations
Five sets of segmentations were recorded: prospectively called clinical lesions (CL) were retrospectively segmented by a prostate MRI researcher (blinded) with 6 months of experience using series and slice number and descriptions from clinical reports and their embedded pictograms, under supervision of a board-certified senior radiologist with 10 years of experience in prostate MRI (blinded). Here, we utilized the same segmentations also used for previous training and validation of the U-Net. Three radiologists performed previously unreported new independent retrospective interpretations according to PI-RADS version 2.1 blinded to pathology results and clinical reports but with access to PSA values. Radiologists included an expert with 12 years of prostate MRI experience (blinded, R1), a board-certified radiologist with 3 years of experience (blinded, R2) and a radiology resident with 2 years of experience and research focus in prostate MRI (blinded, R3). Segmentations were performed separately on diffusion-weighted (DWI) images (using both apparent diffusion coefficient (ADC) and b-value = 1500 s/mm2 (B1500) for lesion assessment) and T2-weighted (T2w) images using the medical imaging toolkit (MITK, www.mitk.org) [23] [24]. Three-dimensional (3 D) volumes of interest (VOI) were manually drawn using the MITK polygon tool by the investigators. Investigators were instructed to review the entire examination including T2w, DWI and DCE images first and then perform the segmentations integrating the visual appearance and the mental image of the appearance on all sequences for delineation. In addition, probability maps were provided by U-Net. Briefly, U-Net was previously trained and validated for prostate and lesion segmentation using a cohort of 250 examinations for training and cross-validation and another cohort of 62 examinations for independent testing [11] and has demonstrated comparable performance to clinical PI-RADS interpretation. For 134 of the 165 included examinations that were part of the original training set, the four U-Nets of the full ensemble of 16 U-Nets that had not been trained with each respective case were used to calculate average probability maps for each patient. For the remaining 31 examinations that were part of the original test set, all 16 U-Net members of the ensemble contributed to the probability maps. Probability maps were thresholded at 0.22, corresponding to the threshold mimicking a PI-RADS 3 assessment established during definition and validation of the U-Net [11]. In the resulting binary images, non-contiguous regions were separated, and each isolated region considered as a separate lesion segmentation. In addition, prostate contours were segmented manually on T2w images. The prostate masks were then automatically partitioned into sextants according to the Ginsburg protocol using the mid-sagittal plane and four additional angulated planes [5] [25].
#
Histopathology and fused ground truth
Histopathological analyses were performed under supervision of one dedicated uropathologist (blinded, 17 years of experience) according to International Society of Urological Pathology standards. sPC was defined as International Society of Urological Pathology (ISUP) grade group ≥ 2 [26]. Separate assessment of the ISUP grade group was provided for each MR targeted lesion and for each sextant of the Ginsburg biopsy scheme. As it is known that targeted biopsies can underestimate sPC in MR index lesions [21], we use a fused ground truth for optimal assessment of histopathology attributable to any MR segmentation. For this purpose, each sextant segmentation is assigned its corresponding systematic core histopathology. Then, the voxel-wise overlap between each sextant segmentation and each clinical targeted segmentation is determined and all sextants that overlap are assigned to the maximum ISUP grade group of either systematic sextant or targeted lesion histopathology. This way a high-quality ground truth is established. The utilized extended systematic and targeted biopsy approach has been previously shown to have excellent sensitivity (97 %) for the presence of sPC compared to radical prostatectomy specimens [1].
#
Statistical analysis
Overlap analysis was based on T2w segmentations. We used only lesions for the analysis for which at least one of the other raters provided an overlapping segmentation. Some raters provided segmentations for probably benign lesions such as BPH nodules, which were excluded from the analysis. In this way, we exclude the detection task from the assessment and focus only on the a posteriori task of outlining the lesion boundary once at least two raters have determined that a lesion should be segmented at a specific location. Descriptive statistics were used to summarize the demographic and clinical characteristics of the patient cohort and the distribution of sPC into prostate zones and ISUP grade group. A logistic mixed model with random patient effects was used to test for difference in sPC probability between MR lesions in the PZ and TZ. The mutual voxel overlap of each lesion segmentation with each other lesion segmentation per examination was determined. Dice coefficient [17] was calculated as:
With ǀAǀ and ǀBǀ indicating the cardinality (the number of voxels) in segmentations A and B, respectively, and A ∩ B indicating the set of voxels that overlap between segmentations A and B. We determined the average Dice coefficient across lesions for each pair of readers as a measure of pairwise inter-rater agreement. To compare each reader to all other readers, each lesion of that rater was regarded as a reference lesion and the Dice coefficients of all other readers’ lesions overlapping with the reference lesion were averaged. The result is a metric that represents the congruence of lesion outlines between each reference rater and all other raters that decided to segment a lesion at the same location. To study the influence of lesion size mismatch, we calculated the lesion size ratio as the ratio of the size of the larger to the smaller lesion. Linear mixed models with random patient and lesion effects were constructed for pairwise comparison of average reader Dice coefficients. All pairwise comparisons were performed using Tukey’s methods for multiplicity adjustment. In addition, models corrected for log-transformed voxel ratio were calculated. Analyses were performed separately for all lesions, for lesions containing no sPC, and for those containing sPC to examine whether readers agree more in sPC-positive lesions. The results were plotted as jittered dot plots superimposed on box plots. Lesion size distribution was plotted as histograms. A correlation between voxel ratio and Dice coefficient was plotted and smoothed using local regression. Per-lesion deviation of lesion size from the mean of all raters was depicted as box plots. Statistical analyses were performed using R version 3.6 (R Foundation for Statistical Computing, Vienna, Austria, using packages emmeans and lme4) [27]. Reporting followed Standards of Reporting of Diagnostic Accuracy [28].
#
#
Results
General characteristics and descriptive statistics
From 425 men presenting at our institutions in the inclusion period, 165 men (median age: 64 years; interquartile range: 58–71 years) met the inclusion and exclusion criteria ([Fig. 1]). Of the 165 patients included in the study, 82 (50 %) were biopsy naïve, 50 (30 %) had received prior negative biopsies and 33 (20 %) were enrolled in active surveillance at the time of the MR exam. Baseline epidemiological and clinical characteristics including pathological findings, and clinical assessment are given in [Table 1]. [Table 2] gives a detailed overview of all overlapping lesions, MR assessments per lesion, number of MR lesions per patient, and number of overlapping lesions with other raters for each rater individually. For all raters including CNN, complete agreement that a lesion was present at a specific location occurred most frequently (41–44 % of lesions) compared to only 3 additional raters agreeing which was the second most common occurrence for human raters (25–30 % of lesions), while for CNN it was agreement with one additional rater (30 % of lesions).
cohort |
n = 165 |
age (years) |
|
|
64 (58–71) |
sPC found in MRI lesion (n (%)) |
436/868 (50 %) |
no sPC found in MRI lesion |
432/868 (50 %) |
lesions in the PZ |
632/868 (75 %) |
lesions in the TZ |
220/868 (25 %) |
multi-zonal lesions |
16/868 (2 %) |
sPC in PZ lesions |
317/639 (50 %)[‡] |
sPC in TZ lesions |
109/217 (50 %)[‡] |
sPC in multi-zonal lesions |
10/12 (83 %) |
per-patient maximum Gleason Score/ISUP grade group (n (%)) |
|
|
59 (36 %) |
|
28 (17 %) |
|
52 (32 %) |
|
11 (7 %) |
|
4 (2 %) |
|
7 (4 %) |
|
4 (2 %) |
|
0 (0 %) |
PSA (ng/ml) median (IQR) |
7.2 (5.2–10.4) |
biopsy distribution per patient (n (%)) |
|
|
82 (50 %) |
|
50 (30 %) |
|
33 (20 %) |
IQR = interquartile range; PSA = prostate specific antigen; MRI = magnetic resonance imaging; sPC = clinically significant prostate cancer; PZ = peripheral zone; TZ = transition zone; GG = ISUP grade group.
‡ difference in logistic mixed model not significant (p = 0.16) in sPC and non-sPC containing lesions between PZ and TZ.
PI-RADS = Prostate Imaging Reporting and Data System; PZ = peripheral zone; TZ = transition zone.
#
CNN has lower lesion boundary agreement compared to human raters
Box plots of average per-lesion Dice coefficients between each reference reader and all other readers are given in [Fig. 2]. Averages of lesion-based Dice coefficients for all pairwise rater comparisons are given in [Table 3], demonstrating that lesion boundary agreement according to Dice coefficients was lower for CNN compared to human raters. Linear mixed model analyses with random patient and lesion effects demonstrated significant differences in Dice coefficients between CNN and all human raters (p < 0.0001 for all comparisons) ([Table 4]). Lesion-based mean and standard deviation of each reference rater to all other raters within the set of all lesions is given in [Table 4]. The mean Dice coefficient with all other raters was 0.22 for CNN compared to 0.48–0.52 for the other readers.
#
U-Net segments smaller lesions compared to human raters
[Fig. 3] depicts histograms of lesion sizes in voxels for each reader, with PI-RADS categories and zonal distribution being indicated separately. CNN segments the largest number of small lesions up to 1000 voxels (N = 107 compared to 32, 28, 57 and 46 for prospective, and readers 1–3, respectively), while it segments fewer lesions of larger size compared to the human raters. [Fig. 4A] demonstrates the percent deviation of segmentation size from the per-lesion rater mean. CNN segmentations were on average 21 % smaller than the mean of all raters and significantly smaller in all pairwise comparisons (all p < = 0.0001). In addition, some of the large multi-zonal lesions are broken into smaller parts and thus appear both in the group of the smallest and largest lesions for CNN, while – as expected – these lesions are all segmented into the largest lesion group by all human readers.
#
Lesion boundary agreement depends on segmentation size
[Fig. 4B] demonstrates the dependence of Dice coefficients on the lesion size ratio. Lesion size mismatch is an important factor contributing to low Dice coefficients, showing e. g. for CNN that at comparable lesion sizes Dice coefficients approximate 0.5, while they fall to about one quarter of that as the size mismatch reaches 10:1.
#
CNN agreement remains reduced after correction for segmentation size
Adjusting linear mixed model analyses with random patient and lesion effects for segmentation size did not remove significant differences in Dice coefficients between CNN and all human raters (p < 0.0001) ([Table 4]). Thus, the difference in segmentation size only partially explains the overall lower Dice coefficient of CNN. This is supported by segmentations of Reader 1 being an average of 18 % larger than segmentations of all other raters and significantly larger than Reader 2 (p < 0.0001) and Reader 3 (p < 0.001), while there is an absence of a significant overlap difference for Reader 1 ([Fig. 4A], [Table 4]). Most of the size difference for Reader 1 is explained by segmentation of a smaller number of small PI-RADS 3 lesions (< 1000 voxels) and a larger number of PI-RADS 4 lesions between 1000 and 2000 voxels in size ([Fig. 3]).
#
Agreement in sPC lesions
Additional analysis of the subgroups of sPC and non-sPC lesions is depicted in [Fig. 2] for all lesions and given in [Table 3], [4]. Dice coefficients with other readers were significantly higher (p < 0.001, [Table 4]) for CNN in lesions with sPC compared to non-sPC lesions, but not for human readers. However, this significance was removed (p = 0.3) after adjusting for voxel ratio.
[Fig. 5] shows representative examples of lesion segmentations with a higher Dice coefficient of 0.71–0.77 (A) and low Dice coefficients of 0.08–0.16 (B) between CNN and other segmentations.
#
Influence of zonal lesion location
There was no statistical difference in the rate of sPC in reader-detected lesions between the PZ and the TZ (p = 0.17). Box plots of average per-lesion Dice coefficients between each reference reader and all other readers are depicted in [Fig. 6] stratified by the peripheral zone (PZ) and the transition zone (TZ). Additional analysis separated by PZ and TZ is given in [Table 3], [4]. When analyzing average agreement between readers stratified by zonal location, there was a slight but non-significant lower agreement in TZ lesions overall (average Dice coefficient for TZ lesions lower by 0.04, p = 0.09). Accordingly, for the readers individually – including CNN – there was no significant difference between the mean Dice coefficient of PZ and TZ lesions (all p > 0.067, [Table 4]). The reported differences in average agreement between CNN and other readers were independent of the zonal location, i. e. there was no heterogeneity in difference between readers with respect to zonal locations (interaction p = 0.9). Accordingly, pairwise comparisons of the average Dice coefficient between CNN and other readers remained significant in separate analyses of PZ and TZ lesions (all p < 0.0001), indicating significantly lower average agreement of CNN for both zonal locations at a similar magnitude compared to zone-independent analysis. Lesions were larger in the TZ (mean 6524, IQR 5999 voxels) compared to PZ (mean 3687, IQR 2695 voxels; p < 0.0001) for all raters except for CNN (TZ: mean 2272, IQR 2019 voxels; PZ: mean 2499, IQR 2464 voxels; p = 0.34). The bottom part of [Fig. 3] shows that the larger number of small CNN lesions does not favor the PZ or TZ specifically but affects both zones similarly.
#
#
Discussion
In this study we compare the agreement in lesion segmentation on prostate MRI between 3 radiologists, retrospectively performed segmentations of clinically reported lesions and U-Net segmentations. We find that the mean Dice coefficients between manual segmentations are 0.48–0.52, establishing that radiologists do not completely agree about the boundaries of lesions that at least two readers consider reportable. This is in accordance with the known difficult task of visual prostate MRI interpretation which is also a source of inter-rater variability [18]. We find that a previously established automatic deep learning system (U-Net) which has achieved similar performance to clinical PI-RADS assessment [11] produces segmentations that are systematically smaller than segmentations of human raters. This smaller size bias of the CNN is a major contributor to lower segmentation agreement with Dice coefficients of 0.25–0.28 compared to human raters. U-Net thus appears to have a tendency to focus on the lesion core while radiologists use additional clues to also capture the lesion periphery. It is important to note that U-Net was trained using two-dimensional cross-entropy loss and not a Dice loss function. As such, training attempted to reproduce as much of the lesion area, but had to balance false-positive discoveries outside of the known tumor foci. As such, the training resulted in a delineation of the highest probability voxels within a lesion. In addition, the calibration of the U-Net which was performed on the validation set during model development to approximate the clinical performance at PI-RADS≥ 3 and ≥ 4 as closely as possible led to the selection of probability thresholds 0.22 for emulation of PI-RADS≥ 3 decisions, and 0.33 for emulation of PI-RADS≥ 4 decisions. Most of the lesion area not currently included in CNN segmentations would become segmented if the threshold would be lowered below the 0.22 setting. However, this would increase the discovery of false-positive lesions elsewhere in the prostate and affect performance unless a local neighborhood region growing criterion were used to expand the lesion territory. Another likely important factor is the co-registration of DWI images to T2 used in U-Net training and predictions. U-Net reports those voxels that are suspicious on bi-parametric (T2w and DWI) co-registered data. If co-registration is not perfect, voxels of imperfect overlap will be excluded from the segmentation as these incorrectly co-register to a normal-appearing T2 signal with suspicious ADC findings or vice versa. Co-registration in prostate MRI is a known difficult challenge [29], as DWI and T2w images are often affected by significant misregistration due to DWI distortion by susceptibility interfaces to rectal air and bowel motion/varying rectal distension during the examination. The published U-Net used rigid followed by affine registration, while newer developments in our group are using optimized registration algorithms, e. g. b-spline registration and non-linear registration. It is possible that remaining misregistration accounts for some of the observed Dice coefficient reduction in this study. While lesion size is identified as an important factor resulting in reduced Dice coefficients for U-Net, it does not explain all of the differences, e. g. lesion size segmentation differences existed also within human readers. Reader 1 segmented on average larger lesions. However, there was no significantly reduced agreement between the human readers. One possible explanation is that CNN lesions are shifted relative to the common lesion core of human raters while human raters more consistently agree on the lesion center. Therefore, on average, the lesions of Reader 1 encompass the smaller lesions from the other human raters more concentrically than the CNN lesions are encompassed by all other lesions.
Dai et al. have previously evaluated a region proposal CNN (Mask R-CNN) to determine the Dice coefficient between radiologists and automated segmentation. They used a total of 120 patients (78 public, 42 private) with training of the lesion detection using 45 public patients and 21 private patients. These numbers are smaller compared to the 250 patients used for training of the currently evaluated CNN. Also, the U-Net approach used here represents a different paradigm than the region proposal approach in that it directly determines the segmentation from the data. It is not possible to determine which of the two (Mask R-CNN and U-Net) is better by comparison to the Dai et al. study. This would require comparison of both algorithms in the same cohort. Dice coefficients reported by Dai et al. in the private testing set were 0.38–0.46 compared to 0.28–0.31 in sPC lesions in this study. In our review, the public PROSTATEx-2 dataset has larger and clearer lesions than the consecutive clinical cohorts presenting in the clinical routine to our institution, thus explaining the higher Dice coefficients compared to our study. Dai et al. did not specify whether consecutive clinical patients were included, while it is important to reflect the clinical routine as closely as possible to arrive at clinically meaningful comparisons. In the present study it is an advantage that the CNN operated on the same consecutive patients that were seen by radiologists in the clinical routine. Dice coefficients have limited direct comparability between studies for several reasons. The distribution of lesions may differ markedly, e. g. patients may present with more advanced tumors to one institution and with more subtle imaging findings to another institution. Dai et al. did not report average PSA levels of the included patients. In fact, they provided almost no demographic or clinical information, which would partly have allowed better estimation of cohort differences. Also, the paradigm of lesion delineation may affect the Dice coefficient. In Dai et al. three radiologists delineated lesions using the histopathological results and clinical reports. As far as can be told, the three radiologists shared the work of delineation as no comparison of radiologist performance is given. The depth of comparison of lesion delineation in the current study is more profound, as three radiologists independently delineated lesions while performing independent re-reads of the clinical cases, thus reflecting the true clinical decision process, unbiased by knowledge of the biopsy results and previous reports. In addition, the prospective clinical performance was compared. In the current study the variability of such independent radiologist assessment can be directly compared to CNN performance, a novel insight which is of importance to correctly put CNN performance into context. Importantly, the diagnostic performance of the CNN evaluated in this study in terms of sPC detection has been previously shown [11]. The same diagnostic performance in terms of sPC detection was achieved even if the CNN segments lesions systematically smaller than the radiologists. As such, the CNN may focus more on the lesion core, while the radiologists focus more on outlining the entire lesion. It is important to note that while differences in voxel-wise agreement between radiologists and CNN exist, the resulting diagnostic performance is of primary importance such that the clinical value of a CNN cannot be determined only from examining metrics of voxel-wise agreement such as the Dice coefficient. While Dai et al. did not provide a zone-specific analysis, we also performed separate analyses in the TZ and PZ and found that the differences in Dice coefficients between CNN and the other readers are present at the same magnitude in both zones. In addition, the agreement with the other readers did not differ significantly in the PZ or TZ for any of the readers. While independence of quantitative mADC assessment [30] from zonal lesion location has been reported previously, and the diffusion component of the mpMRI examination has become more important in the most recent PI-RADS version 2.1 update [31], our findings provide further evidence that prostate zones do not govern differences in the lesion detection task neither for radiologists nor for the examined CNN. We note that sizes of lesions detected by radiologists were larger in the TZ compared to the PZ, likely reflecting the more difficult task of identifying small lesions in the often more heterogeneous TZ. This difference was absent for CNN, indicating that CNN training did not result in the data-driven formation of a zone-specific lesion size filter representation in the network.
Lesion overlap Dice coefficient to multi-rater data as utilized in this study is surely an expensive measure as performance of segmentations by multiple radiologists is prohibitive in any large scale situation, where it is difficult enough to collect a single expert’s segmentations. As such, it is of high interest to find that radiologists evaluated by the same metric as CNN do not perform much better than 0.5. This number can be regarded as a new benchmark for inter-rater agreement for the task of lesion boundary definition that can be used to evaluate if the lesion definitions agree with manual segmentations in the same way as different manual segmentations agree with each other. It is important to note that clinical usefulness is of primary concern when developing an AI system. In end-to-end AI systems [16], all that counts is that the entire processing from raw data to the clinically essential diagnostic information is included in a single trainable algorithm. Segmentation in itself is not mandatory to arrive at a prediction of the risk of a patient to harbor sPC. However, directly comparing segmentations as in the current study is crucial for AI explainability [32], since, especially for medical applications, it is important to assert that AI systems follow generally accepted patterns in medical decision making. As one finds in this study, differences may exist and require evaluation of their potential effect on the diagnostic utility of the method.
An important finding is that the lesion agreement according to Dice coefficient showed no significant changes when focusing only on sPC-positive lesions compared to sPC-negative lesions for the human raters, while U-Net demonstrated higher Dice coefficients in sPC-positive lesions before correction for lesion size mismatch, but not afterwards. For human raters this indicates that benign findings on prostate MRI can exactly mimic the appearance of cancer such that they are visually indistinguishable. For U-Net it is possible that smaller lesion segmentations in non-sPC situations reflect a stricter focus of U-Net on the lesion core which may in fact exhibit a size difference between sPC and non-sPC lesions, while human raters include more of the full lesion which may exhibit a similar size on average.
Our study has several limitations. The histopathological standard of reference was MR-TRUS fusion biopsy and not radical prostatectomy (RP). However, the sensitivity of the extended systematic and targeted biopsy performed here has previously been shown to detect 97 % of sPC compared to RP specimen [12]. In addition, only a cohort based on biopsies can encompass all men that are important to consider in a screening setting of men with suspicion for sPC. Any RP cohort will exclude many screening-relevant men, as it only focuses on men with biopsy-proven surgery-eligible sPC. The study design was retrospective. However, all eligible patients undergoing MRI and fusion biopsy in the inclusion interval were analyzed.
#
Conclusion
We find that human experts with various levels of training do not perfectly agree with respect to their delineation of lesions that they otherwise jointly detect, reflecting the difficulty of the visual task of lesion detection in the prostate. The examined CNN had a lower Dice coefficient compared to human raters than human raters compared to one another. Importantly, the performance of CNN in the prediction of sPC is not affected by this finding and has been validated to be comparable to clinical PI-RADS interpretation before. The remaining difference is not a primary training criterion but could be incorporated in future CNN training to assure more agreement of intermediate results such as segmentations that contribute to final diagnostic assessment. It may also be the result of U-Net focusing more on the lesion core and some contribution from remaining misregistration between T2 and DWI images, which would be large enough to affect the Dice coefficient but small enough not to impede successful lesion detection. Our study provides an example of comparative performance evaluation of CNN to human operators which may be used in addition to common comparison metrics such as precision and recall in future studies.
#
#
Conflict of Interest
Patrick Schelb, Anoshirwan Andrej Tavakoli, Teeravut Tubtawee, Thomas Hielscher have nothing to declare. Jan Philipp Radtke declares payment for consultant work from Saegeling Medizintechnik, Siemens Heathineers and for development of educational presentations from Saegeling Medizintechnik. Magdalena Görtz, Viktoria Schütz, Tristan Anselm Kuder, Lars Schimmöller have nothing to declare. Albrecht Stenzinger declares: Consulting fee and payment for lectures: Astra Zeneca, BMS, Novartis, Roche, Illumina, Thermo Fisher. Travel support: Astra Zeneca, BMS, Novartis, Illumina, Thermo Fisher. Board Member: Astra Zeneca, BMS, Novartis, Thermo Fisher. Markus Hohenfellner has nothing to declare. Heinz-Peter Schlemmer declares: Consulting fee, payment for lectures and travel support: Siemens Healthineers, Curagita, Profound, Bayer Vital. David Bonekamp declares. Consulting fee or honorarium: Bayer Vital, Payment for lectures: Bayer Vital.
-
References
- 1 Radtke JP, Kuru TH, Boxler S. et al Comparative analysis of transperineal template saturation prostate biopsy versus magnetic resonance imaging targeted biopsy with magnetic resonance imaging-ultrasound fusion guidance. J Urol 2015; 193: 87-94
- 2 Siddiqui MM, Rais-Bahrami S, Truong H. et al Magnetic resonance imaging/ultrasound-fusion biopsy significantly upgrades prostate cancer versus systematic 12-core transrectal ultrasound biopsy. Eur Urol 2013; 64: 713-719
- 3 Ahmed HU, El-Shater Bosaily A, Brown LC. et al Diagnostic accuracy of multi-parametric MRI and TRUS biopsy in prostate cancer (PROMIS): a paired validating confirmatory study. Lancet (London, England) 2017; 389: 815-822
- 4 Kasivisvanathan V, Rannikko AS, Borghi M. et al MRI-Targeted or Standard Biopsy for Prostate-Cancer Diagnosis. The New England journal of medicine 2018; 378: 1767-1777
- 5 Bonekamp D, Schelb P, Wiesenfarth M. et al Histopathological to multiparametric MRI spatial mapping of extended systematic sextant and MR/TRUS-fusion-targeted biopsy of the prostate. European radiology 2018;
- 6 Stabile A, Dell'Oglio P, De Cobelli F. et al Association Between Prostate Imaging Reporting and Data System (PI-RADS) Score for the Index Lesion and Multifocal, Clinically Significant Prostate Cancer. Eur Urol Oncol 2018; 1: 29-36
- 7 Padhani AR, Weinreb J, Rosenkrantz AB. et al Prostate Imaging-Reporting and Data System Steering Committee: PI-RADS v2 Status Update and Future Directions. Eur Urol 2018; 75: 385-396
- 8 Weinreb JC, Barentsz JO, Choyke PL. et al PI-RADS Prostate Imaging – Reporting and Data System: 2015, Version 2. Eur Urol 2016; 69: 16-40
- 9 Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In, Advances in neural information processing systems 2012: 1097-1105
- 10 Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In, International Conference on Medical image computing and computer-assisted intervention Springer; 2015: 234-241
- 11 Schelb P, Kohl S, Radtke JP. et al Classification of Cancer at Prostate MRI: Deep Learning versus Clinical PI-RADS Assessment. Radiology 2019; 293: 607-617
- 12 Yoo S, Gujrathi I, Haider MA. et al Prostate Cancer Detection using Deep Convolutional Neural Networks. Scientific reports 2019; 9: 19518
- 13 Sanford T, Harmon SA, Turkbey EB. et al Deep-Learning-Based Artificial Intelligence for PI-RADS Classification to Assist Multiparametric Prostate MRI Interpretation: A Development Study. J Magn Reson Imaging 2020;
- 14 Bonekamp D, Jacobs MA, El-Khouli R. et al Advancements in MR imaging of the prostate: from diagnosis to interventions. Radiographics 2011; 31: 677-703
- 15 Campanella G, Hanna MG, Geneslaw L. et al Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat Med 2019; 25: 1301-1309
- 16 Wang Z, Liu C, Cheng D. et al Automated Detection of Clinically Significant Prostate Cancer in mp-MRI Images Based on an End-to-End Deep Neural Network. IEEE Trans Med Imaging 2018; 37: 1127-1139
- 17 Dice LR. Measures of the amount of ecologic association between species. Ecology 1945; 26: 297-302
- 18 Greer MD, Brown AM, Shih JH. et al Accuracy and agreement of PIRADSv2 for prostate cancer mpMRI: A multireader study. J Magn Reson Imaging 2017; 45: 579-585
- 19 Barentsz JO, Richenberg J, Clements R. et al ESUR prostate MR guidelines 2012. European radiology 2012; 22: 746-757
- 20 Rothke M, Blondin D, Schlemmer HP. et al [PI-RADS classification: structured reporting for MRI of the prostate]. Rofo 2013; 185: 253-261
- 21 Radtke JP, Schwab C, Wolf MB. et al Multiparametric Magnetic Resonance Imaging (MRI) and MRI-Transrectal Ultrasound Fusion Biopsy for Index Tumor Detection: Correlation with Radical Prostatectomy Specimen. Eur Urol 2016; 70: 846-853
- 22 Kuru TH, Wadhwa K, Chang RT. et al Definitions of terms, processes and a minimum dataset for transperineal prostate biopsies: a standardization approach of the Ginsburg Study Group for Enhanced Prostate Diagnostics. BJU Int 2013; 112: 568-577
- 23 Fritzsche KH, Neher PF, Reicht I. et al MITK diffusion imaging. Methods Inf Med 2012; 51: 441-448
- 24 Nolden M, Zelzer S, Seitel A. et al The Medical Imaging Interaction Toolkit: challenges and advances: 10 years of open-source development. Int J Comput Assist Radiol Surg 2013; 8: 607-620
- 25 Kuru TH, Wadhwa K, Chang RTM. et al Definitions of terms, processes and a minimum dataset for transperineal prostate biopsies: a standardization approach of the G insburg S tudy G roup for E nhanced P rostate D iagnostics. BJU international 2013; 112: 568-577
- 26 Egevad L, Delahunt B, Srigley JR. et al International Society of Urological Pathology (ISUP) grading of prostate cancer – An ISUP consensus on contemporary grading. APMIS 2016; 124: 433-435
- 27 Team RC. R: A language and environment for statistical computing. Vienna, Austria: 2013
- 28 Bossuyt PM, Reitsma JB, Bruns DE. et al Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Radiology 2003; 226: 24-28
- 29 Litjens G, Toth R, van de Ven W. et al Evaluation of prostate segmentation algorithms for MRI: the PROMISE12 challenge. Medical image analysis 2014; 18: 359-373
- 30 Bonekamp D, Kohl S, Wiesenfarth M. et al Radiomic Machine Learning for Characterization of Prostate Lesions with MRI: Comparison to ADC Values. Radiology 2018; 289: 128-137
- 31 Turkbey B, Rosenkrantz AB, Haider MA. et al Prostate Imaging Reporting and Data System Version 2.1: 2019 Update of Prostate Imaging Reporting and Data System Version 2. Eur Urol 2019; 76: 340-351
- 32 Gunning D. Explainable artificial intelligence (xai). Defense Advanced Research Projects Agency (DARPA), nd Web 2017; 2: 2
Correspondence
Publication History
Received: 15 July 2020
Accepted: 29 September 2020
Article published online:
19 November 2020
© 2020. Thieme. All rights reserved.
Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany
-
References
- 1 Radtke JP, Kuru TH, Boxler S. et al Comparative analysis of transperineal template saturation prostate biopsy versus magnetic resonance imaging targeted biopsy with magnetic resonance imaging-ultrasound fusion guidance. J Urol 2015; 193: 87-94
- 2 Siddiqui MM, Rais-Bahrami S, Truong H. et al Magnetic resonance imaging/ultrasound-fusion biopsy significantly upgrades prostate cancer versus systematic 12-core transrectal ultrasound biopsy. Eur Urol 2013; 64: 713-719
- 3 Ahmed HU, El-Shater Bosaily A, Brown LC. et al Diagnostic accuracy of multi-parametric MRI and TRUS biopsy in prostate cancer (PROMIS): a paired validating confirmatory study. Lancet (London, England) 2017; 389: 815-822
- 4 Kasivisvanathan V, Rannikko AS, Borghi M. et al MRI-Targeted or Standard Biopsy for Prostate-Cancer Diagnosis. The New England journal of medicine 2018; 378: 1767-1777
- 5 Bonekamp D, Schelb P, Wiesenfarth M. et al Histopathological to multiparametric MRI spatial mapping of extended systematic sextant and MR/TRUS-fusion-targeted biopsy of the prostate. European radiology 2018;
- 6 Stabile A, Dell'Oglio P, De Cobelli F. et al Association Between Prostate Imaging Reporting and Data System (PI-RADS) Score for the Index Lesion and Multifocal, Clinically Significant Prostate Cancer. Eur Urol Oncol 2018; 1: 29-36
- 7 Padhani AR, Weinreb J, Rosenkrantz AB. et al Prostate Imaging-Reporting and Data System Steering Committee: PI-RADS v2 Status Update and Future Directions. Eur Urol 2018; 75: 385-396
- 8 Weinreb JC, Barentsz JO, Choyke PL. et al PI-RADS Prostate Imaging – Reporting and Data System: 2015, Version 2. Eur Urol 2016; 69: 16-40
- 9 Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In, Advances in neural information processing systems 2012: 1097-1105
- 10 Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In, International Conference on Medical image computing and computer-assisted intervention Springer; 2015: 234-241
- 11 Schelb P, Kohl S, Radtke JP. et al Classification of Cancer at Prostate MRI: Deep Learning versus Clinical PI-RADS Assessment. Radiology 2019; 293: 607-617
- 12 Yoo S, Gujrathi I, Haider MA. et al Prostate Cancer Detection using Deep Convolutional Neural Networks. Scientific reports 2019; 9: 19518
- 13 Sanford T, Harmon SA, Turkbey EB. et al Deep-Learning-Based Artificial Intelligence for PI-RADS Classification to Assist Multiparametric Prostate MRI Interpretation: A Development Study. J Magn Reson Imaging 2020;
- 14 Bonekamp D, Jacobs MA, El-Khouli R. et al Advancements in MR imaging of the prostate: from diagnosis to interventions. Radiographics 2011; 31: 677-703
- 15 Campanella G, Hanna MG, Geneslaw L. et al Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat Med 2019; 25: 1301-1309
- 16 Wang Z, Liu C, Cheng D. et al Automated Detection of Clinically Significant Prostate Cancer in mp-MRI Images Based on an End-to-End Deep Neural Network. IEEE Trans Med Imaging 2018; 37: 1127-1139
- 17 Dice LR. Measures of the amount of ecologic association between species. Ecology 1945; 26: 297-302
- 18 Greer MD, Brown AM, Shih JH. et al Accuracy and agreement of PIRADSv2 for prostate cancer mpMRI: A multireader study. J Magn Reson Imaging 2017; 45: 579-585
- 19 Barentsz JO, Richenberg J, Clements R. et al ESUR prostate MR guidelines 2012. European radiology 2012; 22: 746-757
- 20 Rothke M, Blondin D, Schlemmer HP. et al [PI-RADS classification: structured reporting for MRI of the prostate]. Rofo 2013; 185: 253-261
- 21 Radtke JP, Schwab C, Wolf MB. et al Multiparametric Magnetic Resonance Imaging (MRI) and MRI-Transrectal Ultrasound Fusion Biopsy for Index Tumor Detection: Correlation with Radical Prostatectomy Specimen. Eur Urol 2016; 70: 846-853
- 22 Kuru TH, Wadhwa K, Chang RT. et al Definitions of terms, processes and a minimum dataset for transperineal prostate biopsies: a standardization approach of the Ginsburg Study Group for Enhanced Prostate Diagnostics. BJU Int 2013; 112: 568-577
- 23 Fritzsche KH, Neher PF, Reicht I. et al MITK diffusion imaging. Methods Inf Med 2012; 51: 441-448
- 24 Nolden M, Zelzer S, Seitel A. et al The Medical Imaging Interaction Toolkit: challenges and advances: 10 years of open-source development. Int J Comput Assist Radiol Surg 2013; 8: 607-620
- 25 Kuru TH, Wadhwa K, Chang RTM. et al Definitions of terms, processes and a minimum dataset for transperineal prostate biopsies: a standardization approach of the G insburg S tudy G roup for E nhanced P rostate D iagnostics. BJU international 2013; 112: 568-577
- 26 Egevad L, Delahunt B, Srigley JR. et al International Society of Urological Pathology (ISUP) grading of prostate cancer – An ISUP consensus on contemporary grading. APMIS 2016; 124: 433-435
- 27 Team RC. R: A language and environment for statistical computing. Vienna, Austria: 2013
- 28 Bossuyt PM, Reitsma JB, Bruns DE. et al Towards complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Radiology 2003; 226: 24-28
- 29 Litjens G, Toth R, van de Ven W. et al Evaluation of prostate segmentation algorithms for MRI: the PROMISE12 challenge. Medical image analysis 2014; 18: 359-373
- 30 Bonekamp D, Kohl S, Wiesenfarth M. et al Radiomic Machine Learning for Characterization of Prostate Lesions with MRI: Comparison to ADC Values. Radiology 2018; 289: 128-137
- 31 Turkbey B, Rosenkrantz AB, Haider MA. et al Prostate Imaging Reporting and Data System Version 2.1: 2019 Update of Prostate Imaging Reporting and Data System Version 2. Eur Urol 2019; 76: 340-351
- 32 Gunning D. Explainable artificial intelligence (xai). Defense Advanced Research Projects Agency (DARPA), nd Web 2017; 2: 2