Introduction
Colonoscopy is the gold standard for colorectal cancer diagnosis and subsequent surveillance.
The quality of colonoscopy substantially alters the efficacy of adenomatous polyp
detection and colorectal cancer diagnosis. The American Society for Gastrointestinal
Endoscopy (ASGE), British Society of Gastroenterology (BSE), European Society of Gastrointestinal
Endoscopy (ESGE), and the Canadian Association of Gastroenterology (CAG) have baseline
quality standards for colonoscopy evaluation [1]
[2]
[3]
[4]
[5]. These metrics include cecal intubation rate (> 90 %–95 %), withdrawal time (> 6
minutes), and adenoma detection rate (> 15 %–25 %).
The adenoma detection rate (ADR) is a particularly well-characterized quality indicator
and is inversely related to the development of post-colonoscopy colorectal cancer
(PCCRC) [6]. Other quality metrics, including adequacy of bowel preparation and sufficient withdrawal
time, have also been associated with higher ADR and lower rates of subsequent PCCRC
[7]
[8]
[9]
[10]. Likewise, colonoscopy completion, defined by cecal intubation, has been shown to
be negatively associated with PCCRC development [11]. However, there is considerable variability in cecal intubation rates (CIR) and
photodocumentation among healthcare practitioners and facilities. The CIR ranges from
58.8 % to 100 % with photodocumentation varying from 6 % to 81 % [12]
[13]
[14]
[15]
[16]
[17]. The variability in photodocumentation within institutions adds an additional barrier
to quality improvement in colonoscopy without an objective means to assess for cecal
intubation. At present, there are few automated means to record and confirm colonoscopy
completion to maintain quality indicator standards.
Within endoscopy, there has been a shift toward implementing artificial intelligence
(AI) to improve endoscopy with enhanced diagnostics. There have been prior attempts
at using artificial intelligence to detect cecal intubation without machine learning,
including edge detection by geometric shapes, intensity change and saturation [18]
[19]. However, artificial intelligence research in endoscopy has been accelerated by
machine learning. In particular, these techniques have been best described in the
computer-assisted detection of polyps (CADe), and histologic prediction of diminutive
polyps (CADx) [20]
[21]
[22]
[23]
[24]
[25]. However, there have also been some studies implementing AI into quality indicators,
including bowel preparation calculation, withdrawal time, and natural language processing
in automating ADR calculation [26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]. Certain studies have automated the detection of cecal intubation with artificial
intelligence, but they have not assessed these algorithms in suboptimal conditions
[33]. There have been no studies evaluating colonoscopy completion under suboptimal conditions,
including variable bowel preparation. In this study, we develop a deep convolutional
neural network capable of detecting the presence of the appendiceal orifice as a marker
of cecal intubation and colonoscopy completion with variable bowel preparation.
Methods
This was a retrospective study using high-definition videos of colonoscopy procedures
conducted at St. Michael’s Hospital in Toronto, Canada, from 2015 to 2018. This study
was approved by the St. Michael’s Hospital Research Ethics Board (19-050).
Datasets and preprocessing
The image dataset was derived from videos of colonoscopy procedures recorded during
previous interventional studies conducted at St. Michael’s Hospital between 2015 and
2018 [34]
[35]
[36]. These videos did not have any patient identifiers and only images of bowel lumen
were extracted. We screened 144 procedures from previous studies. Videos were included
if the recorded colonoscopy was completed (beginning at the rectum, reaching the cecum,
and withdrawn back to the rectum). Videos were excluded from the study if the recorded
colonoscopy was incomplete (i. e. cecum was not reached), or if the video itself was
incomplete (i. e. recording does not begin at the rectum, and/or does not show withdrawal
back to the rectum after cecal intubation). A total of 35 videos were included into
this study. The videos were converted into images at 10 frames per second using Adobe
Photoshop CC 2019 software (San Jose, California, United States). Images were then
classified as either: (1) containing the appendiceal orifice (AO); or (2) not containing
the AO (non-AO). These images were subclassified into Boston Bowel Preparation Scale
(BBPS) scores (a commonly used evaluative tool for assessment of quality of bowel
preparation) 0, 1, 2, and 3 for a segment of bowel. Score 0 was defined as a segment
of mucosa not seen because of solid stool that could not be cleared. Score 1 was defined
as a portion of mucosa seen, but other areas of the segment not well seen because
of staining, residual stool or opaque liquid. Score 2 was defined as a minor amount
of residual staining, small fragments of stool, or opaque liquid, with the mucosa
well seen. Score 3 was defined as a well visualized segment of colon without any staining,
fragments of stool or opaque liquid [37]. The classification process was conducted by expert gastroenterologists (i. e. > 1000
completed procedures). The identification of the appendiceal orifice was first located
in the videos to ensure correct landmarking. Images that did not provide information
regarding appendiceal orifice or bowel preparation, such as images of red outs, irrigation,
fluid levels, biopsy forceps, blurry images, were left unclassified. The spectrum
of non-AO images was included to simulate conditions encountered in colonoscopy.
In total, 13, 522 images were collected from the videos. There were 6852 images of
AO and 6670 images in non-AO group. The AO images included full-view images of the
AO, partial views of the appendiceal orifice, or cecal landmarks suggestive of the
appendiceal orifice (triradiate cecal folds). Within the dataset, 6,559 AO images
and 6,663 non- AO images were utilized for training, validation and testing: 11,900
images for training and validation and 1,322 images for testing. We ensured that the
proportion of AO to non-AO images was consistent at each of the training, validation
and testing phases. We additionally rescaled all images to a size of 224 × 224 pixels.
In addition, to allow for greater generalizability, we applied several data augmentation
strategies to the training data. These augmentations included: resized cropping, horizontal
and/or vertical flips, random rotation up to 30 degrees, and random affine transforms
up to a factor of 10.
Dense convolutional neural networks
We used DenseNet, a dense convolutional neural network architecture, that was pre-trained
on approximately 1.2 million images with SIFT transforms from the ImageNet dataset
as the backbone of our model [38]
[39]. The DenseNet backbone connects each layer to every other layer in a feed-forward
fashion. It has several advantages including stronger feature propagation, feature
reuse, and fewer parameters, ultimately leading to a smaller model size. In our implementation,
we adopted the DenseNet169 model architecture, but replaced the last layer with our
customized classifier for appendiceal orifice detection [38]
[40]. All experiments were implemented using Pytorch and scikit-learn libraries.
Training and testing
We used a batch size of 128 for both the training and validation datasets. We used
an Adam optimizer with an initial learning rate of 3 × 10–4 and a scheduler to decay the learning rate of each parameter group by 0.1 every 7
epochs. We optimized our model by using cross entropy loss as the criterion for the
task, which combines a log softmax operation with a negative log likelihood loss.
For validation of the algorithm, we took all the validation examples available to
us and cross-verified that the minimization of the loss function was improving in
each training epoch.
Outcomes and statistical analysis
The primary outcome of the study was to evaluate the operating characteristics of
a deep convolutional neural network trained to detect AO vs non-AO and to assess the
performance characteristics of the algorithm with variable bowel preparation scores
[37]. The operating characteristics of interest include the detection rate of AO and
non-AO along with the overall accuracy, sensitivity, specificity, positive predictive
value, and negative predictive value of the model. The F1 score of the model was also
calculated as a metric to balance precision and recall of the deep convolutional neural
network.
Results
Dataset characteristics
There were a total of 13,513 images (6847 AO images and 6666 non-AO images) extracted
from 35 colonoscopy videos. In terms of AO images, there were no additional findings
identified in the images, including no diverticula, polyp, or vascular lesions. With
respect to BBPS scores for AO images, there were 0 (0.0 %) BBPS 0 images, 2378 (34.7 %)
BBPS 1 images, 3924 (57.3 %) BBPS 2 images, and 545 (8.0 %) BBPS 3 images. Within
these images, 2378 (34.7 %) had inadequate BBPS scores (< 2), while 4469 (65.3 %)
images had adequate BBPS scores (≥ 2). Within the non-AO images, there were 5153 images
(77.3 %) that were assigned BBPS scores, and 1513 BBPS unclassifiable (22.7 %) images.
There were 133 additional findings (2.6 %) within the images assigned BBPS scores,
all of which were polyps. In terms of BBPS unclassifiable images, there were 1023
blurry images (67.6 %), 249 (67.6 %)images of fluid levels or irrigation (67.6 %),
34 images of instrumentation (2.2 %), and 207 images (13.7 %) of redouts. Among the
images assigned BBPS scores, there were 0 (0.0 %) BBPS 0 images, 647 (9.7 %) BBPS
1 images, 3103 (46.6 %%) BBPS 2 images, and 1403 (21.1 %) BBPS 3 images. There were
647 images (12.6 %) of inadequate BBPS scores (< 2) and 4506 images (87.4 %) with
adequate BBPS scores (≥ 2) ([Table 1]).
Table 1
Adequacy and subclassification of Boston Bowel Preparation Scores (BBPS) for AO and
non-AO images in the dataset.
|
BBPS 0
|
BBPS 1
|
BBPS 2
|
BBPS 3
|
BBPS unclassifiable
|
BBPS ≥ 2
|
BBPS ≤ 1
|
AO
|
0 (0.0 %)
|
2378 (34.7 %)
|
3924 (57.3 %)
|
545 (8.0 %)
|
0 (0.0 %)
|
4469 (65.3 %)
|
2378 (34.7 %)
|
Non-AO
|
0 (0.0 %)
|
647 (9.7 %)
|
3103 (46.6 %)
|
1403 (21.1 %)
|
1513 (22.7 %)
|
4506 (87.4 %)
|
647 (12.6 %)
|
AO, appendiceal orifice.
AO and non-AO detection
A total of 1,322 images were used for testing, composed of 656 AO (50.0 %) and 666
non-AO (50.0 %) images. The test set was representative of the proportion of the classes
during training and validation. The proposed model is able to correctly classify the
appendiceal orifice and non-appendiceal orifice with an overall accuracy of 94 % on
the test dataset demonstrating excellent discrimination between the two images classes
([Fig. 1]). The AUROC for this neural network is 0.98 ([Fig. 2]). The operating characteristics of the model can be found in [Table 2]. The F1 score of the model was 94.3 % ([Table 2]).
Fig. 1 Training and validation loss and accuracy curves for the deep convolutional neural
network.
Fig. 2 Area under receiver operator characteristic curve for the deep convolutional neural
network.
Table 2
Detection characteristics of deep convolutional neural network in AO and Non-AO images.
|
Sensitivity
|
Specificity
|
Positive predictive value
|
Negative predictive value
|
F1 score
|
Model operating characteristics
|
96.5 %
|
92.0 %
|
92.3 %
|
96.4 %
|
94.3 %
|
|
True positive
|
True negative
|
False positive
|
False negative
|
|
Model test set
|
633
|
613
|
53
|
23
|
|
AO, appendiceal orifice.
AO and non-AO Detection with variable bowel preparation
With regards to the AO test set characteristics, there were 0 (0.0 %) BBPS 0 images,
355 (54.1 %) BBPS 1 images, 255 (38.9 %) BBPS 2 images, and 46 (7.0 %) BBPS 3 images.
There were 355 images (54.1 %) with inadequate BBPS scores (< 2) and 301 images (45.9 %)
with adequate BBPS scores (≥ 2). In the non-AO test group, there were 0 (0.0 %) BBPS
0 images, 81 (12.2 %) BBPS 1 images, 315 (47.3 %) BBPS 2 images, 116 (17.4 %) BBPS
3 images, and 154 (23.1 %) BBPS unclassifiable images. There were 81 images (15.8 %)
of inadequate BBPS scores (< 2) and 431 images (84.2 %) with adequate BBPS scores
(≥ 2) ([Table 3]). The performance of the algorithm with variable bowel preparation in the test set
for AO detection was 97.5 %, 95.3 % and 95.7 % in BBPS 1, 2, and 3, respectively.
When stratified for inadequate (BBPS < 2) and adequate (BBPS ≥ 2) bowel preparation,
AO detection was 97.5 % and 95.4 %, respectively. Likewise, non-AO detection for BBPS
1, 2, 3, and unclassifiable was 84.0 %, 89.8 %, 98.3 %, and 96.1 %, respectively.
In terms of inadequate (BBPS < 2) and adequate (BBPS ≥ 2) bowel preparation, non-AO
detection was 84.0 % and 92.1 %, respectively ([Table 4]). The model characteristics for BBPS1, BBPS2, and BBPS3 images, along with adequacy
of bowel preparation can be found in [Table 5].
Table 3
Adequacy and subclassification of Boston Bowel Preparation Scores (BBPS) for AO and
non-AO images in test set.
|
BBPS 0
|
BBPS 1
|
BBPS 2
|
BBPS 3
|
BBPS unclassifiable
|
BBPS ≥ 2
|
BBPS ≤ 1
|
AO
|
0 (0.0 %)
|
355 (54.1 %)
|
255 (38.9 %)
|
46 (7.0 %)
|
0 (0.0 %)
|
355 (45.9 %)
|
301 (54.1 %)
|
Non-AO
|
0 (0.0 %)
|
81 (12.2 %)
|
315 (47.3 %)
|
116 (17.4 %)
|
154 (23.1 %)
|
431 (84.2 %)
|
81 (15.8 %)
|
AO, appendiceal orifice.
Table 4
Performance of deep convolutional neural network with varying Boston Bowel Preparation
Scores (BBPS) in AO and non-AO images.
|
BBPS 0
|
BBPS 1
|
BBPS 2
|
BBPS 3
|
BBPS unclassifiable
|
BBPS ≥ 2
|
BBPS ≤ 1
|
AO
|
N/A
|
97.5 %
|
95.3 %
|
95.7 %
|
N/A
|
95.4 %
|
97.5 %
|
Non-AO
|
N/A
|
84.0 %
|
89.8 %
|
98.3 %
|
96.1 %
|
92.1 %
|
84.0 %
|
AO, appendiceal orifice.
Table 5
Detection characteristics of deep convolutional neural network in AO and Non-AO images
stratified by Boston Bowel Preparation Scale (BBPS) scores.
|
BBPS 1
|
BBPS 2
|
BBPS 3
|
Unclassifiable
|
BBPS ≥ 2
|
BBPS ≤ 1
|
False negative
|
9
|
12
|
2
|
–
|
14
|
9
|
False positive
|
13
|
32
|
2
|
6
|
34
|
13
|
True positive
|
346
|
243
|
44
|
–
|
287
|
346
|
True negative
|
68
|
283
|
114
|
148
|
397
|
68
|
Sensitivity
|
97.5 %
|
95.3 %
|
95.7 %
|
–
|
95.4 %
|
97.5 %
|
Specificity
|
84.0 %
|
89.8 %
|
98.3 %
|
96.1 %
|
92.1 %
|
84.0 %
|
PPV
|
96.4 %
|
88.4 %
|
95.7 %
|
–
|
89.4 %
|
96.4 %
|
NPV
|
88.3 %
|
95.9 %
|
98.3 %
|
100.0 %
|
96.6 %
|
88.3 %
|
AO, appendiceal orifice; PPV, positive predictive value; NPV, negative predictive
value.
Discussion
We present a deep convolutional neural network with an accuracy of 94 % and an area
under the receiver operating curve of 0.98 in discriminating images of the AO from
those that do not depict the AO, with variable bowel preparation. The algorithm had
overall excellent operating characteristics in sensitivity, specificity, positive
predictive value and negative predictive. When assessing bowel preparation, AO detection
was > 95 % irrespective of BBPS score and adequacy of bowel preparation. However,
non-AO detection progressively improved (from 84.0 % to 98.3 %) with BBPS score and
was superior with adequate (92.1 %) compared to inadequate (84.0 %) bowel preparation.
The improving operating characteristics from non-AO BBPS 1 to BBPS 3 can be attributed
to a number of factors. For example, the clear visualization of the lack of cecal
landmarks is more difficult with worsening bowel preparation because of increased
background noise. In a review of the false-positive images (non-AO interpreted as
AO) in BBPS 1 and BBPS 2 classes, the majority of the images had some feature that
could be misinterpreted as part of the triradiate cecal folds. These features coupled
with increasing noise from worsening bowel preparation led to the misclassification
of these images. This was compounded by the fact that there were limited non-AO BBPS
1 images (9.7 %) for training. This is dissimilar to AO BBPS 3 images, in which the
performance was excellent (96.7 %) despite a limited data set (8.0 %), as the presence
of cecal landmarks can be clearly identified. Of note, there were no BBPS 0 images
in the dataset. Although we do not expect our algorithm to have difficulty with this
classification given that cecal landmarks would be completely obscured, it would be
an important addition to simulate real-life colonoscopy conditions. To improve the
false-positive rate and lower BBPS score classifications of non-AO images, a larger
number of images with more variations are required for training, validation, and testing.
Although the distribution of BBPS scores was not equal, this did not bias our algorithm
as it was trained for detection of the AO and non-AO, and not bowel preparation. Likewise,
the fluctuation in the BBPS proportions in the test set compared to the overall dataset
is attributed to random allocation that was conducted for AO and non-AO images, but
not for bowel preparation. Despite the excellent accuracy and operating characteristics
in AO detection across all bowel preparation classes, our system was only trained
and tested on 35 videos with a relatively limited number of images in our dataset.
The model, particularly lower BBPS non-AO images, can be improved with a larger balanced
data set for training and testing to enhance variability and to improve generalizability.
As the model was trained and validated with static images, this algorithm’s application
to recorded videos and real-time colonoscopy have not yet been determined and require
further research.
In our review of the literature, existing applications of AI in gastroenterology have
focused primarily on developing computer-assisted devices for detection and pathology
prediction of polyps [21]
[22]
[23]
[24]
[25]. There is growing interest in the implementation of AI in assessing quality indicators
in colonoscopy. In particular, algorithms have been used to assess for bowel preparation
and withdrawal time [32]
[33]. However, this is among the first machine learning algorithms created to assess
for cecal intubation in the presence of variable bowel preparation. The algorithm
adds to the pre-existing literature in synthesizing differing quality metric parameters
and simulates greater real-world conditions in colonoscopy. Moreover, the robustness
of the algorithm is demonstrated under variable and suboptimal conditions. Given that
colonoscopy quality indicators occur within a spectrum, the ability for algorithms
to perform under variable circumstances is particularly relevant. Although the validation
of artificial intelligence algorithms in controlled environments is important, their
impact may be greater under subpar circumstances. For example, greater benefit may
be obtained from the detection of polyps in inadequate bowel preparation, or in lower-quality
colonoscopy systems with worse spatial resolution. As such, machine learning algorithms
should be evaluated in both optimal and suboptimal conditions to broaden the applicability
of their use cases and to derive maximal benefit in imperfect circumstances.
The applications of AI pertaining to colonoscopy completion are significant. Although
all major gastroenterology societies have thresholds for colonoscopy completion, there
is considerable variability in cecal intubation and photodocumentation among hospitals
and practitioners. Of concern, the rates of cecal intubation and photodocumentation
among certain providers pale in comparison to standards set forth by multiple gastroenterology
societies [12]
[13]
[14]
[15]
[16]
[17]. The maintenance of this quality metric is significant, as lower rates of colonoscopy
completion are associated with higher rates of PCCRC [11]. Despite this, there are no formal auditing practices to ensure the maintenance
of quality indicators in endoscopy as it is both cost and time-intensive. One common
quality improvement initiative has demonstrated that providing intermittent feedback
to clinicians regarding their cecal intubation rates through report cards can improve
cecal intubation rates [41]
[42]. Likewise, other studies have shown a possible association between time of day and
worsening endoscopy quality with reductions in ADR and cecal intubation rates as a
workday progresses, possibly related to practitioner fatigue [43]
[44]
[45]. As a result, implementing a computer-assisted device for detection of colonoscopy
completion may provide a method of quality indicator feedback by facilitating automated
documentation and objective detection of cecal intubation.
Conclusions
In summary, we successfully created an algorithm using a deep convolutional neural
network with excellent accuracy for detection of the AO under variable bowel preparation.
Moving forward, this algorithm requires a larger dataset for training, and implementation
in real-time colonoscopy to elucidate its applications more clearly. Within the domain
of quality indicators in colonoscopy, the synthesis of other AI quality metric algorithms
in suboptimal conditions is necessary for future testing to derive greater benefit
in improving and maintaining colonoscopy quality.