Introduction
Colorectal cancer is the second leading cause of cancer death in the United States [1 ]. Colonoscopy screening is performed in many countries to prevent colorectal cancer by removal of precancerous polyps (adenomas and serrated polyps) [2 ]
[3 ].
The effectiveness of colonoscopy screening to prevent colorectal cancer is dependent on high adenoma detection rates (ADR), i. e., the proportion of colonoscopies where at least one adenoma is detected [4 ]. There is considerable variation in ADR between individual colonoscopists. Patients examined by colonoscopists with a high ADR, compared to those examined by clinicians with a lower ADR, experience a lower risk of colorectal cancer after colonoscopy (so-called interval cancer) [4 ]
[5 ].
In recent years, breakthroughs in machine learning have enabled the development of artificial intelligence (AI) computer systems to help colonoscopists detect polyps and adenomas (also known as computer-aided detection (CADe) [6 ]. While AI-based systems may improve detection rates of premalignant polyps and thus potentially reduce future cancer risk, they may also contribute to reduced focus on colonoscopist training, to distraction of endoscopists, and to overdiagnosis of polyps [7 ]
[8 ].
We performed a systematic review and meta-analysis to ascertain the potential benefits and harms of the use of recently developed AI-based polyp detection systems during colonoscopy compared to colonoscopy without AI.
Methods
We performed a systematic search for prospective studies, including randomized trials that directly compared AI-based polyp detection systems during colonoscopy with colonoscopy without the use of AI-based polyp detection systems. The protocol for this systematic review and meta-analysis is registered with PROSPERO (CRD42020171860).
We included studies of screening, surveillance, or diagnostic colonoscopy in individuals ≥ 18 years old. We collected the following primary outcomes for all eligible studies:
ADR: the number of all complete colonoscopies (cecum intubation achieved) in which one or more adenomas were detected divided by the total number of colonoscopies
Polyp detection rate (PDR): the number of all complete colonoscopies (cecal intubation achieved) in which one or more polyps were detected divided by the total number of colonoscopies
Mean number of adenomas detected per colonoscopy
Mean number of polyps detected per colonoscopy
Mean number of advanced adenomas detected per colonoscopy, i. e., total number of advanced adenomas detected divided by the total number of colonoscopies performed. Advanced adenomas were defined according to current guidelines as size ≥ 10 mm, and/or with villous components > 20 %, and/or high grade dysplasia [3 ]
[6 ].
Secondary outcomes were mean number of adenomas stratified by predefined polyp size groups; rates of false-positive and false-negative AI diagnoses; colorectal cancer detection rates; and withdrawal time during colonoscopy (time to withdraw the colonoscope from the cecum to the anus) [4 ].
Search strategy and study selection
A medical librarian searched MEDLINE, EMBASE, and the Cochrane Central Register of Controlled Trials (CENTRAL) (Appendix 1 s , see online-only Supplementary material). We did not apply any time or language restrictions. We also reviewed reference lists from all eligible studies for additional relevant studies. We searched ClinicalTrials.gov for ongoing studies, and contacted investigators for information when studies were relevant.
All citations were imported into Covidence (Covidence systematic review software, Veritas Health Innovation, Melbourne, Australia). Two independent reviewers (I.B., D.G.V.) screened all search results. First, the reviewers screened titles and abstracts of the retrieved citations. Secondly, full texts of all potentially relevant studies were examined and evaluated for further eligibility. Thirdly, the two reviewers compared their results, and any disagreements were discussed in order to arrive at consensus. All reasons for exclusions of full texts were recorded.
Data extraction
Two reviewers (I.B., D.G.V.) independently extracted data using a standardized form. They collected the following information from each eligible trial: study characteristics (country, study design, study period), description of participants (number of patients in control and intervention group, sex, age, indication for colonoscopy), and which AI-based system was used. Two reviewers (I.B., H.C.J.) independently extracted outcomes data using a standardized form. There were no disagreements.
Assessment of risk of bias
We followed the Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach to rate the certainty of evidence for estimates derived from meta-analyses [9 ]. Two reviewers (I.B., D.G.V.) independently assessed risk of bias, using a modified Cochrane Collaboration tool for assessing risk of bias in randomized clinical trials, as either low or high risk of bias [10 ]. The assessed domains were random sequence generation (selection bias), allocation concealment (selection bias), blinding (performance and detection bias), incomplete outcome data (attribution bias), selective reporting (reporting bias), and other biases. Consensus was reached through discussion, and when needed, disagreements were resolved in the author group.
Data synthesis and statistical analysis
All analyses were based on per-protocol analyses of the trials. Summary measures were analyzed using the restricted maximum likelihood model. For dichotomous outcomes, we calculated relative risks (RRs) with 95 %CIs. For continuous outcomes, a Poisson distribution was assumed, and mean differences were calculated with 95 %CIs. Absolute risks are the difference in overall detection rates and mean number between non-AI and AI groups. We ascertained statistical heterogeneity among studies using the chi-squared test (threshold P = 0.10), and it was quantified using the I
2 statistic. We planned to use funnel plots to examine the extent of publication bias for outcomes if 10 or more studies were included [11 ].
We preformed two a priori sensitivity analyses, one excluding studies where the endoscopist was blinded, and another excluding studies assessed to have a high risk of bias.
We used Stata version 16.1 (StataCorp, College Station, Texas, USA) for all data analyses. We adhered to the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) reporting standards (Appendix2 s ) [12 ].
Results
Our search yielded 1548 potentially relevant articles published up till February 12 2020. Searches in reference lists of published articles did not result in additional papers. We contacted three lead investigators of two trials registered in ClinicalTrials.gov which were overdue to end recruitment [13 ]
[14 ] and received a reply from one of them who notified us that their trial is currently under review for publication [14 ]. We did not get access to these data.
After review, five prospective studies, all randomized controlled trials (RCTs), were included in meta-analyses ([Fig. 1 ]) [15 ]
[16 ]
[17 ]
[18 ]
[19 ]. A total of 4311 patients were included in the trials ([Table 1 ]). The trials were published in 2019 or 2020, and patients were enrolled between 2017 and 2019. All five were conducted in China. Three out of the five RCTs were single-blinded [15 ]
[16 ]
[17 ], one was non-blinded [18 ], and one was double-blinded [19 ]. Indications for colonoscopy in the trials were reported as either screening, symptomatic, or surveillance. All trials excluded patients with inflammatory bowel disease, history of colorectal cancer, history of radiotherapy and/or chemotherapy, and biopsy contraindications.
Fig. 1 Artificial intelligence for polyp detection during colonoscopy: study flowchart [12 ].
Table 1
Characteristics of included studies.
Authors, year Study design
Patients
Total (AI, non-AI), n
Men/Women, n
Mean age, AI, non-AI, years
Colonoscopy indication, n (%)
Polyps (AI, non-AI), n
AI system
Endoscopists, n
Wang P et al. 2019 [18 ]
RCT, nonblinded
1058 (522, 536)
512/546
51.1, 49.9
Screening: 84 (7.9) Symptomatic: 974 (92.1)
767 (498,269)
EndoScreener, Wision AI, Shanghai, China
8
Wang P et al. 2020 [19 ]
RCT, double-blinded, sham control
962 (484, 478)
495/467
49, 49
Screening: 158 (16.4) Symptomatic: 804 (83.6)
809 (501,308)
EndoScreener; TensorRT, Nvidia, Santa Clara, California, USA)
4
Gong D et al. 2020 [15 ]
RCT, single-blinded
642 (324, 318)
345/359
50, 49
Health examination: 123 (17.5) Diagnostic: 545 (77.4) Surveillance: 36 (5.1)
592[1 ]
(383,209)
ENDOANGEL, Wuhan University, Wuhan, China
6
Su JR et al. 2020 [16 ]
RCT, single-blinded
623 (308, 315)
307/316
50.5, 51.6
Screening/surveillance: 216 (34.7) Symptomatic: 407 (65.3)
273 (177,96)
AQCS model, Jinan, China
6
Liu W et al. 2020 [17 ]
RCT, single-blinded
1026 (508, 518)
551/475
51, 50.1
Screening: 66 (6.4) Symptomatic: 960 (93.6)
734 (486,248)
CADe system, Henan Xuanweitang Medical Information Technology, Zhengzhou City, China
Not specified
Total
4311 (2146, 2165)
2210 /2163
3175 (2045,1130)
AI, artificial intelligence; RCT, randomized controlled trial.
*Calculated from polyps detected per colonoscopy times number of colonoscopies.
1 Calculated from polyps detected per colonoscopy times number of colonoscopies.
AI systems used
All trials adopted convolutional neural network-based algorithms (the most commonly used model for deep learning) and were developed with input from endoscopists and modelers [15 ]
[16 ]
[17 ]
[18 ]
[19 ]. The five trials examined four different AI-based polyp detection systems ([Table 1 ]). Algorithms adopted in three trials [17 ]
[18 ]
[19 ] were designed to identify the location and presence of polyps, while another study [15 ] was focused on assuring the quality of colonoscopy by monitoring the withdrawal speed, detecting observation blind spots, and assessing bowel preparation. The latter algorithm did not directly serve to identify the polyp location, however the improved colonoscopy quality due to the use of AI (e. g., lowered withdrawal speed, re-inspection of blind spots) could have contributed to the increased polyp detection. One study [17 ] used an algorithm for both polyp identification and quality assurance.
Two trials were conducted by the same research team and used the same AI-based polyp detection system (EndoScreener; Wision AI, Shanghai, China) [18 ]
[19 ]. The last of these two, also used software to reduce latency of detection (TensorRT; Nvidia, Santa Clara, California, USA) [19 ], and a sham system in the non-AI group (Wision AI). The sham AI system displayed alert boxes on random polyp-like and nonpolyp images without pointing at actual polyps during colonoscopy.
Another difference among these five trials related to the number of images constituting the database used for machine learning and development of the algorithms: 5545 images [18 ]
[19 ], 3350 images [16 ], 535 videos [17 ], and 6378 images [15 ]. This difference in database size can potentially influence the performance of each AI model, since a larger number of images may contribute to higher diagnostic accuracy [7 ].
Risk of bias assessment
As shown in Appendix 3 s , four out of the five trials met at least one criterion for high risk of bias. Four trials had high risk of bias in terms of blinding (performance and detection bias) [15 ]
[16 ]
[17 ]
[18 ]. One trial met none of the criteria for high risk of bias [19 ] and another was deemed to have high risk of bias in all categories but one [17 ]. Two trials had high risk of bias for incomplete data [15 ]
[17 ], and one had high risk of bias for selective reporting [17 ].
Primary outcomes
Colonoscopy with AI-based polyp detection systems increased adenoma and polyp detection rates compared to colonoscopy without AI assistance. ADR with AI was 29.6 % [95 %CI 22.2 – 37.0] versus 19.3 % [95 %CI 12.7 – 25.9] without AI, giving an RR (or risk ratio) of 1.52 (95 %CI 1.31 – 1.77, with I
2 = 48 % ([Fig. 2b ]). The certainty of evidence was rated as high (Appendix 4 s ). In a sensitivity analysis excluding the double-blinded RCT [19 ], the ADR with AI was 28.5 % (95 %CI 19.4 – 37.5) versus 17.2 % (95 %CI 10.5 – 23.9) without AI, giving a RR 1.61 (95 %CI 1.42 – 1.82, with heterogeneity reduced from moderate to low (I
2 = 0 %) (Appendix 5 s ).
Fig. 2 Forest plots showing colonoscopy detection rates with versus without artificial intelligence (AI) assistance for the included studies. Relative risks (risk ratios) are shown with 95 %CIs. a Polyp detection. b Adenoma detection. REML, restricted maximum likelihood.
Polyp detection rate with AI was 45.4 % (95 %CI 41.1 – 49.8) versus 30.6 % (95 %CI 26.5 – 34.6) without AI (RR 1.48 [95 %CI 1.37 – 1.60], I
2 = 0 %; [Fig. 2a ]). The certainty of evidence was rated as high (Appendix 4 s ).
There was no difference in the mean number of advanced adenomas per colonoscopy between AI-assisted and non-AI-assisted colonoscopy (0.03 in each group; mean difference 0 [95 %CI – 0.01 to 0.01], I
2 = 0 %) ([Fig. 3c ], [Fig. 4 ]). The mean number of polyps per colonoscopy with AI was 0.93, versus 0.51 without AI (mean difference 0.42 95 %CI 0.33 – 0.50], I
2 = 65 %) ([Fig. 3a ]). The mean number of adenomas per colonoscopy was 0.41 with AI versus 0.23 without AI (mean difference 0.18 [95 %CI 0.13 – 0.22], I
2 = 53 %) ([Fig. 3b ], [Fig. 4 ]).
Fig. 3 Forest plots showing mean number of lesions detected per colonoscopy with versus without artificial intelligence (AI) assistance for the included studies. Mean differences are shown with 95 %CIs. a Mean number of polyps per colonoscopy. b Mean number of adenomas per colonoscopy. c Mean number of advanced adenomas per colonoscopy. REML, restricted maximum likelihood.
Fig. 4 Pooled estimates for mean number of adenomas per colonoscopy and mean number of advanced adenomas per colonoscopy detected with and without artificial intelligence (AI)-based systems during colonoscopy.
None of the a priori sensitivity analyses of mean numbers detected per colonoscopy for polyps, adenomas, or advanced adenomas reduced the moderate to substantial heterogeneity. The certainty of evidence was rated high for mean number of advanced adenomas detected per colonoscopy, but downgraded to moderate, because of inconsistency, for mean numbers of polyps or adenomas detected per colonoscopy.
Secondary outcomes
Stratifying the mean number of adenomas detected per colonoscopy by polyp size showed an increase in detection of small adenomas ≤ 5 mm with AI-assisted as compared to non-AI-assisted colonoscopy (mean difference 0.15 [95 %CI 0.12 – 0.18], I
2 = 0 %), whereas there was no change in the detection of larger adenomas (> 5 to ≤ 10 mm, mean difference 0.03 (95 %CI 0.01 – 0.05], I
2 = 0 %; > 10 mm, mean difference 0.01 [95 %CI 0.00 – 0.02], I
2 = 0 %) ([Fig. 5 ]). The certainty of evidence was rated high.
Fig. 5 Forest plots showing mean number of adenomas detected per colonoscopy for different polyp sizes. Mean differences are shown with 95 %CIs. REML, restricted maximum likelihood
The mean withdrawal time (excluding biopsy time) was 30 seconds longer with AI-assisted as compared to non-AI-assisted colonoscopy (with AI, range 6.18 – 7.03 minutes; without AI, range 5.68 – 6.07).
Measurement of false-positive and false-negative testing with AI was reported similarly in four trials [16 ]
[17 ]
[18 ]
[19 ]. A false-negative event was defined as a polyp detected by the endoscopist, but not detected by the AI system (missed polyp). A false-positive event (false alarm) was defined as a detected polyp as indicated by the AI system, but deemed by the endoscopist not to be a polyp. In four trials the reported number of false-negatives was zero [16 ]
[17 ]
[18 ]
[19 ]. Four of the trials also reported false-positives [16 ]
[17 ]
[18 ]
[19 ], with a range between 7.1 % [17 ] to 20.1 % [16 ]; mean 11.2 %. Patients with colorectal cancer were excluded in all the studies and thus we were unable to report on this outcome.
Discussion
Our systematic review and meta-analysis show that the use of AI-based polyp detection systems during colonoscopy increases adenoma and polyp detection rates, and the mean numbers of adenomas, and polyps detected per colonoscopy. However, AI-assisted colonoscopy did not improve the detection of advanced adenomas (measured as mean number of advanced adenomas detected per colonoscopy) ([Fig. 4 ]). The certainty of evidence for primary outcomes such as ADR, PDR, and mean number of advanced adenomas detected per colonoscopy was rated as high, but downgraded to moderate for mean numbers of adenomas and of polyps detected per colonoscopy. The certainty of evidence for secondary outcomes, such as mean number of adenomas per colonoscopy when stratified by polyp size, was rated as high.
The ADR with AI was 29.6 % compared to 19.3 % without AI. This absolute increase of 10.3 percentage points may be clinically significant with regard to future colorectal cancer prevention. This is in line with ADR improvements achieved with intensive endoscopist training having been shown to reduce colorectal cancer risk [19 ]. Currently, it is unknown whether AI-assisted colonoscopy is needed for all endoscopists, or only for those with suboptimal ADRs. Previous studies have shown a ceiling effect for the association between ADR and colorectal cancer risk: the association is largest for endoscopists with low ADRs, and smaller for those with already high ADRs, with standard colonoscopy [4 ]. Unfortunately, the available trials did not provide information regarding the effect of using AI for individual endoscopists. In one of the included trials, only senior endoscopists with the same level of experience were doing the colonoscopies; this study showed a somewhat smaller difference between ADR for the AI versus non-AI colonoscopies than did the other studies that included less experienced endoscopists [19 ].
ADR, the strongest known procedure-related predictor for future colorectal cancer risk and thus a key quality indicator, includes both nonadvanced and advanced adenomas. Our findings reveal that AI-assisted colonoscopy improves the detection of nonadvanced adenomas; however, the mean number of advanced adenomas was identical in the two comparison groups. AI-based polyp detection systems seem to help in identifying more small polyps, while not contributing significantly to the detection of advanced adenomas ([Fig. 4 ]); the latter have the highest malignant potential and their removal, therefore, has the greatest potential to reduce cancer risk. There is no supporting evidence that detecting a greater number of small adenomas can contribute to a higher reduction in colorectal cancer [20 ]. Thus, AI-based polyp detection systems may lead to increased overdiagnosis of polyps, which in a colonoscopy screening setting can represent a waste of valuable time and resources [8 ]. Further studies are needed to disentangle the value of AI for colorectal cancer protection as compared to potentially increased resources for insignificant polyp removal.
The strengths of this review include an a priori protocol based on the Cochrane and GRADE guidelines [9 ]
[10 ], a comprehensive systematic search in several databases, and a focus on high quality clinical trials.
Weaknesses of the included trials that may affect the pooled estimates include the following. All five trials were conducted in one country, China, with a lower risk of colorectal cancer and lower ADRs than western countries [21 ]. Therefore, the results of these trials may not be generalizable to other countries where baseline ADRs and PDRs are higher. Three of the included trials were single-blinded [15 ]
[16 ]
[17 ] and one was not blinded [18 ]. The results from these studies may result in under- or overestimation of the effects of AI assistance in colonoscopy. However, one study was designed as a double-blinded RCT, using a sham system to keep both patients and endoscopists unaware of the random assignment [19 ], and the results including for this trial alone were similar to the overall results including all the trials.
One study used one endoscopy specialist for the annotation process of the database forming the algorithm behind the AI-based polyp detection system used [17 ]. The other four trials used deep learning models where two or more experts were involved in the annotation process, contributing to the labelling of images in the datasets, and who solved disagreement by consensus, and finally, validated their models on independent datasets [15 ]
[16 ]
[18 ]
[19 ]. Thus, we judged these trials as being less prone to bias, as two image readers plus an optional third reader in cases of disagreement, is recommended to improve accuracy of imaging end points in clinical trials [22 ].
The assessment of false-positive and false-negative polyp detection with AI systems is challenging as it involves a clear definition of gold standards and biopsies of all lesions that are indicated by the AI system. The estimates derived from the studies in this review are hampered by these challenges.
All evaluations were based on per-protocol analyses of the trials included, mainly because only one study provided both intention-to-treat (ITT) and per-protocol analyses [15 ], while the other four studies [16 ]
[17 ]
[18 ]
[19 ] provided data based solely on per-protocol analysis. While we consider ITT the best analysis with regard to results that occur after some follow-up, we do not think this is of importance for the purpose of the present study.
Our review shows that AI-based polyp detection systems increase ADR, PDR, and mean number of polyps detected per colonoscopy, and mean number of adenomas detected per colonoscopy, but shows no difference with regard to advanced adenomas. Studies investigating how AI might provide real-time identification and classification of polyps that may not need removal, may counter the effect of overdiagnosis of small and diminutive polyps. At the same time, AI tools should focus on increased detection of advanced adenomas, where the potential for prevention may be largest.
Clinical trial
The protocol for this systematic review was registered on PROSPERO (CRD42020171860) and is available in full on the NIHR website (https://www.crd.york.ac.uk/prospero/ ).