Ultraschall Med 2024; 45(04): 412-417
DOI: 10.1055/a-2230-2455
Original Article

Artificial intelligence for ultrasound microflow imaging in breast cancer diagnosis

Künstliche Intelligenz für das Ultraschall-Microflow-Imaging in der Brustkrebsdiagnose
Na Lae Eun
1   Radiology, Gangnam Severance Hospital, Seoul, Korea (the Republic of) (Ringgold ID: RIN65655)
,
Eunjung Lee
2   Computational Science and Engineering, Yonsei University, Seoul, Korea (the Republic of) (Ringgold ID: RIN26721)
,
3   Radiology, Bundang CHA Medical Center, Seongnam, Korea (the Republic of) (Ringgold ID: RIN37129)
,
Eun Ju Son
1   Radiology, Gangnam Severance Hospital, Seoul, Korea (the Republic of) (Ringgold ID: RIN65655)
,
Jeong-Ah Kim
1   Radiology, Gangnam Severance Hospital, Seoul, Korea (the Republic of) (Ringgold ID: RIN65655)
,
Ji Hyun Youk
4   Department of Radiology, Yonsei University College of Medicine, Seoul, Korea, Republic of
1   Radiology, Gangnam Severance Hospital, Seoul, Korea (the Republic of) (Ringgold ID: RIN65655)
› Author Affiliations
 

Abstract

Purpose To develop and evaluate artificial intelligence (AI) algorithms for ultrasound (US) microflow imaging (MFI) in breast cancer diagnosis.

Materials and Methods We retrospectively collected a dataset consisting of 516 breast lesions (364 benign and 152 malignant) in 471 women who underwent B-mode US and MFI. The internal dataset was split into training (n = 410) and test datasets (n = 106) for developing AI algorithms from deep convolutional neural networks from MFI. AI algorithms were trained to provide malignancy risk (0–100%). The developed AI algorithms were further validated with an independent external dataset of 264 lesions (229 benign and 35 malignant). The diagnostic performance of B-mode US, AI algorithms, or their combinations was evaluated by calculating the area under the receiver operating characteristic curve (AUROC).

Results The AUROC of the developed three AI algorithms (0.955–0.966) was higher than that of B-mode US (0.842, P < 0.0001). The AUROC of the AI algorithms on the external validation dataset (0.892–0.920) was similar to that of the test dataset. Among the AI algorithms, no significant difference was found in all performance metrics combined with or without B-mode US. Combined B-mode US and AI algorithms had a higher AUROC (0.963–0.972) than that of B-mode US (P < 0.0001). Combining B-mode US and AI algorithms significantly decreased the false-positive rate of BI-RADS category 4A lesions from 87% to 13% (P < 0.0001).

Conclusion AI-based MFI diagnosed breast cancers with better performance than B-mode US, eliminating 74% of false-positive diagnoses in BI-RADS category 4A lesions.


#

Zusammenfassung

Ziel Entwicklung und Bewertung von Algorithmen der künstlichen Intelligenz (KI) für das Ultraschall-Microflow-Imaging (MFI) in der Brustkrebsdiagnose.

Material und Methoden Wir sammelten retrospektiv einen Datensatz, der aus 516 Brustläsionen (364 benigne und 152 maligne) von 471 Frauen bestand, die sich dem B-Mode-US und dem MFI unterzogen. Der interne Datensatz wurde in einen Trainings- (n=410) und einen Testdatensatz (n=106) aufgeteilt, um KI-Algorithmen auf der Grundlage tiefer Convolutional-Neural-Networks aus MFI zu entwickeln. Die KI-Algorithmen wurden trainiert, um das Malignitätsrisiko (0–100%) zu ermitteln. Die entwickelten KI-Algorithmen wurden mit einem unabhängigen externen Datensatz von 264 Läsionen (229 benignen und 35 malignen) weiter validiert. Die diagnostische Leistung von B-Mode-US, KI-Algorithmen oder deren Kombinationen wurde durch Berechnung der Fläche unter der Receiver-Operating-Characteristic-Curve (AUROC) bewertet.

Ergebnisse Die AUROC der 3 entwickelten KI-Algorithmen (0,955–0,966) war höher als die des B-Mode-US (0,842; p<0,0001). Die AUROC der KI-Algorithmen im externen Validierungsdatensatz (0,892–0,920) war ähnlich wie die des Testdatensatzes. Unter den KI-Algorithmen wurde kein signifikanter Unterschied in allen Leistungsmetriken in Kombination mit oder ohne B-Mode-US gefunden. Die Kombination aus B-Mode-US und KI-Algorithmen wies einen höheren AUROC-Wert (0,963–0,972) auf als der B-Mode-US (p<0,0001). Durch die Kombination von B-Mode-US und KI-Algorithmen konnte die falsch-positive Rate von Läsionen der BI-RADS-Kategorie 4A signifikant von 87% auf 13% gesenkt werden (p<0,0001).

Schlussfolgerung Das KI-basierte MFI diagnostizierte Brustkrebs mit besserer Leistung als der B-Mode-US und eliminierte 74% der falsch-positiven Diagnosen bei Läsionen der BI-RADS-Kategorie 4A.


#

Introduction

Breast ultrasound (US) is an invaluable imaging modality in breast cancer diagnosis, either as a first-line examination or in addition to mammography. However, morphologic features of benign and malignant lesions can overlap in breast US, increasing the false-positive rate [1]. The high recall rate and low positive predictive value (PPV) of breast US biopsies encourage the exploration of additional techniques to reduce unnecessary recalls and biopsies [2].

Cancer development and progression can stem from angiogenesis, the proliferation of immature tumor vessels, in response to hypoxia [3]. Therefore, clinical evaluation of tumor vascularity can assist in diagnosing breast cancer and selecting a management plan [4]. Color and power Doppler imaging, as widely available US techniques in addition to B-mode US, can provide a rough representation of angiogenesis by increased vascularity and irregular or penetrating vessels within breast cancer [5]. However, conventional Doppler imaging can detect tumoral vascularization only in large vessels or relatively fast blood flows [6], because the wall filters to remove tissue motion artifacts filter out the signals of small vessels and slow flow due to the overlap between their frequency shifts [7]. To overcome these limitations, a new US Doppler technique, microflow imaging (MFI), was recently developed and implemented to visualize small vessels and slow blood flow signals without using contrast agents [8]. MFI uses a multidimensional filter to selectively remove artifacts while preserving slow flow signals [8].

Although breast US is one of the most effective imaging modalities due to its widespread availability, excellent safety profile, and ease of use, it is inevitably subjective and operator-dependent, resulting in interobserver variability [9] [10]. Furthermore, assessing the complex vascular patterns and quantitative measurements of breast lesions with Doppler US can be clinically challenging. This complexity hinders the establishment of guidelines or recommendations for the optimal classification of vascularity patterns  [9] [11] [12]. Artificial intelligence (AI) using deep learning-based neural networks has recently emerged to incorporate trainable data sets and potentially perform various imaging tasks in medical data [11] . AI algorithms can identify complex shapes in images and quantify radiographic information that is imperceptible to humans [13] . A few attempts have been made to evaluate breast lesions by introducing AI in color Doppler US [14] [15] [16] [17] [18]. However, there has yet to be a study applying AI to MFI imaging data. Thus, we aimed to develop and evaluate AI algorithms for US MFI in breast cancer diagnosis.


#

Materials and Methods

Patients and Datasets

Our Institutional Review Board approved this retrospective study, and informed consent was waived. In this study, we enrolled a total of 516 lesions, including 364 benign and 152 malignant lesions, in 471 women (mean age: 49.5±10.2; age range: 25–78 years). These lesions had been pathologically confirmed through biopsy or categorized as Breast Imaging Reporting and Data System (BI-RADS) category 2 and 3 lesions. The patients underwent breast US and MFI and were followed up for a minimum of two years, between November 2019 and July 2021 at a tertiary hospital (Gangnam Severance hospital). Additionally, Doppler US images are routinely obtained at our institution for representative breast lesions during breast US examinations. Patients were randomly selected to undergo MFI based on the availability of the US machine equipped with MFI capabilities in the suite schedule. Among these patient samples, 410 lesions (290 benign and 120 malignant) were allocated to the training set. Data augmentation was applied for the malignant training data by left-right and up-down flipping to balance the training set. Lastly, a total of 594 US images (290 benign and 290 malignant lesions) were used as the training data ([Fig. 1]). For the test set, 106 lesions (74 benign and 32 malignant) were allocated using computer-generated random numbers ([Fig. 1]). For the external test set, an independent dataset consisting of 264 breast lesions (229 benign and 35 malignant) in 236 women (mean age: 49.0 ± 10.6 and age range: 20–77 years) was obtained at a tertiary medical center (Bundang CHA medical center) from July 2019 to October 2021 under the same inclusion criteria ([Fig. 1]). When calculating the power with the hypothesis that the AUC would exceed 0.7, we observed high power, ranging from 73% to 99%. Therefore, we confirmed that the sample size was sufficient.

Zoom Image
Fig. 1 Study cohorts.

#

Image acquisition

US examinations were performed with one US device (EPIQ Elite; Philips Healthcare, Bothell, WA) and assessed using a 4 to 18-MHz linear transducer by one board-certified radiologist with 17 years of breast imaging experience. Two or more orthogonal B-mode US images were obtained for each breast lesion. After selecting a representative nodule on the B-mode images, MFI was performed with minimal compression from the transducer to visualize the low-velocity flow. The default image parameters for MFI included a velocity scale of 2.5 cm/s, a wall filter of 59 Hz, and a dynamic range of 62 dB. These parameters were individually adjusted for each tumor to acquire the best images.


#

Development of deep convolutional neural network

Each US image was stored as a JPEG file in the picture archiving and communication system. A radiologist determined square regions of interest (ROIs) using Adobe Photoshop (ver. 23.5.0, Adobe Inc.). The ROIs covered the entire nodule and blood flow signals. Only the ROIs were used as input data, increasing algorithm performance by discarding irrelevant information in the deep learning. For the ROI extraction process ([Fig. 2]), we first collected the precise position information of the red ROI boundary. Next, the collected ROI boundary information was applied to a duplicate US image without the ROI border, as marking the ROI with a colored pen can alter the intensity of the US image. In this way, we could block interference from the colored ROI markings while the convolutional neural network (CNN) learned the image features. Our AI algorithm utilized transfer learning, adopting pre-trained CNNs from a considerable amount of non-medical images. Applying the learning to the included cohorts overcame the limited data number and maximized accuracy via big data and deep learning [19] . The 17 CNNs used in this study were AlexNet, SqueezeNet, VGG16, VGG19, GoogLeNet, ResNet18, ResNet50, ResNet101, Inception-v3, DenseNet201, MobileNet-v2, Xception, ShuffleNet, DarkNet19, DarkNet53, EfficientNetB0, and InceptionResNetV2 [20] [21] [22] [23] [24] [25] [26] [27]. Based on our assessment of individual CNN performance and the outcomes of classification ensembles when evaluated with validation data, we arrived at the decision to employ three distinct classification ensembles. These ensembles are named as follows: CNN1 (comprising AlexNet, ResNet50, ResNet101 and Xception), CNN2 (comprising Inceptiont-v3, ResNet50, ResNet101, and Xception), and CNN3 (comprising SqueezeNet, GoogLeNet, ResNet50, and ResNet101). In order to determine the composition of these classification ensembles, we used a 10% randomly chosen subset of the training data to validate the performance of various combinations, including single CNNs and a combination of multiple CNNs. The results unequivocally indicate that combinations involving four or more CNNs generally outperform a single CNN or combinations of two or three CNNs. Furthermore, combining more than four CNNs delivers a performance on par with the combination of four CNNs. All three ensemble models follow the same strategy shown in [Fig. 2].

Zoom Image
Fig. 2 a The image acquisition process and b the structure of CNN with fine-tuning. a In the ROI extraction, we collected location information from the color boundary and duplicated US images without ROI marking. b Then, each CNN was fine-tuned with training images, and the last fully connected layers produced two output probability values.

#

Statistical Analysis

We collected data on each breast lesion’s final diagnosis, recorded in each institution’s electronic medical records. Cancer probabilities were calculated using CNN and were presented as serial numbers between 0 and 1. The diagnostic performance of B-mode US, AI algorithms, and their combinations were calculated and compared. The cut-off value for the area under the receiver operating characteristic curve (AUROC), sensitivity, specificity, PPV, negative predictive value (NPV), accuracy, and F-score was set using the Youden index for AI algorithms and BI-RIADS categories 4A for B-mode US. Because AI algorithms (CNNE1, CNNE2, and CNNE3) are highly correlated, multivariable logistic regression analyses were performed to calculate the malignancy risk of lesions in combined B-mode US and AI algorithms. To develop nomograms, the three models consisted of the significant variables with their intercept and β coefficient from multivariable logistic regression. The probability S was calculated with

Zoom Image

where A depends on the model source as follows: -5.11 + 5.517 × CNNE1 + 2.611 × B-mode US (for the combination of B-mode US and CNNE1), -5.189 + 5.315 × CNNE2 +2.723 × B-mode US (for the combination of B-mode US and CNNE2), and -5.233 + 5.841 × CNNE3 + 2.773 × B-mode US (for the combination of B-mode US and CNNE3).

Statistical analysis was performed by using R package version 4.1.1 (http://www.R-project.org). P-values < 0.05 were considered statistically significant.


#
#

Results

The baseline characteristics of the datasets are demonstrated in Supplementary Table S1. The mean age and lesion size of the training, test, and external set were 49–49.5 years and 12.8–12.9 mm, respectively. The diagnostic performance of B-mode US, the three AI algorithms (CNNE1, CNNE2, and CNNE3), and their combinations of the test set are listed in [Table 1]. All three developed algorithms showed a significantly higher specificity (93.2–94.6% vs. 71.6%, P < 0.0001), accuracy (92.5% vs. 79.2%, P < 0.003), PPV (85.3–87.5% vs. 59.6%, P < 0.001), and AUROC (0.955–0.966 vs. 0.842, P < 0.0001) than B-mode US (Supplementary Fig. S1a). No significant difference was found in the sensitivity (87.5–90.6% vs. 96.9%) and NPV (94.6–95.8% vs. 98.1%) between each AI algorithm and B-mode US. The combined B-mode US and AI algorithms also performed better than B-mode US with respect to specificity (95.9% vs. 71.6%, P < 0.0001), accuracy (94.3% vs. 79.2%, P < 0.001), PPV (90.6% vs. 59.6%, P < 0.0001), and AUROC (0.963–0.972 vs. 0.842, P < 0.0001, Supplementary Fig. S1b). The sensitivity (90.6% vs. 96.9 %) and NPV (95.9% vs. 98.1%) were not significantly different between each combination and B-mode US. The F-scores of CNNE1, CNNE2 (87.9% vs. 73.8%, P < 0.05), and combined B-mode algorithms (90.6% vs. 73.8%, P = 0.01) were also significantly higher than those of B-mode US ([Table 1]). There was no significant difference in all diagnostic indices among the three AI algorithms and combined B-mode US and AI algorithms.

Table 1 Diagnostic performance of B-mode ultrasound, AI algorithms, and their combinations in the internal test set.

Variables

AUROC

Sensitivity (%)

Specificity (%)

Accuracy (%)

PPV (%)

NPV (%)

F-score (%)

AUROC: area under the receiver operating characteristic curve, PPV: positive predictive value, and NPV: negative predictive value

Values in parentheses are 95% confidence intervals.

P-values compared with B-mode US

B-mode US

0.842 (0.784, 0.901)

96.9 (90.6, 100)

71.6 (61.5, 81.7)

79.2 (71.8, 86.7)

59.6 (46.5, 72.7)

98.1 (94.4, 100)

73.8 (63.2, 84.5)

CNNE1

0.966 (0.936, 0.997)

P < 0.0001

90.6 (80.1, 100)

P = 0.17

93.2 (87.6, 98.9)

P = 0.0001

92.5 (87.4, 97.5)

P = 0.002

85.3 (73.3, 97.3)

P = 0.0002

95.8 (91.0, 100)

P = 0.26

87.9 (79.3, 96.4)

P = 0.047

CNNE2

0.964 (0.932, 0.996)

P < 0.0001

90.6 (80.1, 100)

P = 0.17

93.2 (87.6, 98.9)

P = 0.0001

92.5 (87.4, 97.5)

P = 0.0001

85.3 (73.3, 97.3)

P = 0.0002

95.8 (91.0, 100)

P = 0.26

87.9 (79.6, 96.2)

P = 0.04

CNNE3

0.955 (0.912, 0.998)

P < 0.0001

87.5 (75.3, 99.7)

P = 0.08

94.6 (89.6, 99.6)

P < 0.0001

92.5 (87.3, 97.6)

P = 0.003

87.5 (75.9, 99.1)

P = 0.0001

94.6 (89.1, 100)

P = 0.14

87.5 (78.5, 96.5)

P = 0.054

B-mode + CNNE1

0.972 (0.940, 1)

P < 0.0001

90.6 (80.1, 100)

P = 0.17

95.9 (91.5, 100)

P < 0.0001

94.3 (89.9, 98.8)

P = 0.0001

90.6 (80.6, 100)

P < 0.0001

95.9 (91.3, 100)

P = 0.27

90.6 (83.2, 98.0)

P = 0.01

B-mode + CNNE2

0.970 (0.938, 1)

P < 0.0001

90.6 (80.1, 100)

P = 0.17

95.9 (91.5, 100)

P < 0.0001

94.3 (89.9, 98.8)

P = 0.0001

90.6 (80.6, 100)

P < 0.0001

95.9 (91.3, 100)

P = 0.27

90.6 (83.0, 98.2)

P = 0.01

B-mode + CNNE3

0.963 (0.918, 1)

P < 0.0001

90.6 (79.9, 100)

P = 0.16

95.9 (91.5, 100)

P < 0.0001

94.3 (89.8, 98.9)

P = 0.0001

90.6 (80.5, 100)

P < 0.0001

95.9 (91.1, 100)

P = 0.27

90.6 (83.0, 98.2)

P = 0.01

The diagnostic performance of the three AI algorithms and their combinations on the external validation set is presented in Supplementary Table S2. The performance metrics of the AI algorithms on the external validation dataset were similar to those of the test dataset regarding sensitivity (81.5–88.9%), specificity (91.5–97.6%), accuracy (91.2–95.8%), PPV (57.1–81.5%), NPV (97.5–98.5%), and AUROC (0.892–0.920).

Concerning the false-positive rates in BI-RADS category 4A (Supplementary Table S3), B-mode US alone showed a false-positive rate of 87.0% (20 of 23). For AI algorithms, the false-positive rate was 13.0% for CNNE1 and CNNE2 and 8.7% for CNNE3, which is significantly lower than that of B-mode US (P < 0.0001). After combining B-mode US with AI algorithms, the false-positive rates significantly decreased to 13.0% compared with B-mode US alone (P < 0.0001) (Supplementary Table S4). Supplementary Figs. S2 and S3 demonstrate true-negative and true-positive cases in AI algorithms for assessing BI-RADS category 4A lesions.


#

Discussion

Our study showed that all three AI algorithms demonstrated significantly higher specificity, accuracy, PPV, AUROC, and F-score compared to B-mode US in breast cancer diagnosis on MFI. The combined B-mode ultrasound and AI algorithms also outperformed B-mode ultrasound alone in the same metrics. Furthermore, the false-positive rates in BI-RADS category 4A were notably reduced when B-mode ultrasound was combined with AI algorithms, showcasing their potential to improve diagnostic accuracy.

In breast US, most studies of non-contrast-enhanced microvascular flow Doppler techniques were on Superb Microvascular Imaging (SMI, Canon Medical), but to date, no studies have been performed using MFI. In the literature, SMI showed better diagnostic performance than conventional Doppler images for distinguishing between benign and malignant breast lesions. For qualitative parameters, including complex vessel morphology or distribution, B-mode US combined with SMI performed significantly better than color or power Doppler US (AUROC: 0.815–0.852 vs. 0.73–0.778) [28] [29]. Similarly, a quantitative analysis of SMI, including vascularity score, demonstrated superior performance to color Doppler US (AUROC, 0.81 vs. 0.73) [30]. Furthermore, SMI exhibited comparable diagnostic performance to contrast-enhanced US (AUROC: 0.853 vs. 0.841) [31]. 

However, interpreting vascular networks is very complicated and challenging, especially in the case of advanced Doppler US techniques. The inherent limitation of Doppler US is its considerable intra- and inter-reader variability, due primarily to the vascular pattern intricacy and limited quantitative evaluation. With AI’s rapidly growing application in US, studies have investigated AI in Doppler US to overcome these limitations. In a previous study of deep learning models of breast B-mode US, color Doppler US, and power Doppler US, the CNN classifications using the combination of B-mode and color Doppler US showed significantly higher accuracy than radiologists (89.2% vs. 30%). However, no difference was found between CNNs using B-mode US vs. B-mode combined with color and power Doppler US (87.9% vs. 89.2% and 88.7%, respectively) [17]. In other studies, AI with combined B-mode US and color Doppler achieved excellent diagnostic performance with AUROCs ranging from 0.955 to 0.982 [14] [15] [16]. However, no research has yet investigated the application of AI algorithms regarding advanced Doppler US microvascular flow techniques.

Our results demonstrated that the use of AI in MFI achieved better results than B-mode US (AUROC, 0.955–0.966 vs. 0.842, P < 0.0001). The performance of our AI-based MFI in classifying breast lesions was also superior to that of SMI without the aid of AI (AUROC, 0.815–0.912) in the literature [28] [29] [32]. Our result regarding AI algorithms for MFI was consistent with previous studies on AI application in conventional color Doppler (AUROC, 0.955–0.982) [14] [15] [16], suggesting that AI-based MFI can help differentiate breast lesions better than conventional microvascular flow Doppler examinations interpreted by radiologists. Our results also demonstrated no significant difference between the performance of AI algorithms with and without B-mode US (AUROC, 0.955–0.966 vs. 0.963–0.972). This result implies that the AI algorithms for MFI alone might play a role in assessing breast lesions without the assistance of B-mode US. However, further investigation is necessary to apply AI algorithms in combined B-mode US with microvascular flow techniques.

We found that AI applications decreased the false-positive rate of BIRADS category 4A from 87% to 8.7–13%. In agreement with our results, a recent study using an AI system for B-mode and color Doppler images reported that 68.2% of false-positive findings in BI-RADS category 4A were obviated in AI and radiologist hybrid models [16]. Another study also suggested that applying an AI system in multimodal multiview US images, including color Doppler, could reduce 7% of biopsies in benign lesions and increase biopsies by 2% in malignant lesions [15]. Our AI-based MFI also decreased false-positive diagnoses by 74% and showed the potential to enhance radiologists’ decision-making using AI systems in MFI. Further research may be required to determine the advantages of AI systems in MFI on breast US.

We acknowledge several limitations of our study. First, as this is a retrospective study, selection bias may occur as radiologists randomly performed MFI on the representative images at two different institutions. Second, this study did not consider clinicopathological factors other than US images and the BI-RADS category, which may have caused selection bias. Third, controlling for the malignancy proportion was difficult as all included cases were BI-RADS category two or with histologically confirmed lesions. To remedy this variation, we balanced the data with left-right and up-down flipping of the internal training set (not affecting the training process). Finally, generalizability assumptions may exist since we used only one vendor for MFI. Additional research will be necessary to investigate whether other vendors can show desirable diagnostic performance in the future.

In conclusion, AI-based MFI diagnosed breast cancers with better performance than B-mode US and reduced false-positive diagnoses in BI-RADS category 4A lesions.


#
#

Conflict of Interest

The authors declare that they have no conflict of interest.

Supplementary Material


Correspondence

Dr. Ji Hyun Youk
Department of Radiology, Yonsei University College of Medicine
Eonju-ro Gangnam-gu 211
06273 Seoul
Korea, Republic of   

Publication History

Received: 15 June 2023

Accepted after revision: 06 December 2023

Article published online:
09 April 2024

© 2024. Thieme. All rights reserved.

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany


Zoom Image
Fig. 1 Study cohorts.
Zoom Image
Fig. 2 a The image acquisition process and b the structure of CNN with fine-tuning. a In the ROI extraction, we collected location information from the color boundary and duplicated US images without ROI marking. b Then, each CNN was fine-tuned with training images, and the last fully connected layers produced two output probability values.
Zoom Image