Introduction
In recent years, advancements in image processing techniques have profoundly impacted the way gastrointestinal endoscopy is performed. High-definition and 4K video technologies now enable unprecedented visualization of vascular and mucosal details. With this enhanced image quality, the bottleneck in disease detection and diagnosis has shifted from visualization to interpretation by the endoscopist. Artificial intelligence (AI) has already shown great promise in assisting this interpretation [1 ]
[2 ]. Nonetheless, the same technological progress that empowers AI also introduces potential challenges that may frustrate successful implementation in daily practice.
Current endoscopy systems offer a wide range of post-processing enhancement settings that may improve image characteristics such as color, texture, and contrast ([Fig. 1 ]). Endoscopists can change these settings according to their own preferences or specific situations. Enhancement settings, while often subtle to the human eye, may significantly impact the accuracy of AI predictions. [Fig. 2 ] exemplifies this issue. This remarkable behavior can be attributed to a phenomenon known as “domain shift.” This term describes the tendency of deep learning systems to perform well on familiar data but deteriorate quickly on data that slightly deviate from the training dataset [3 ]
[4 ]. In the example shown in [Fig. 2 ], an AI system trained and validated solely on enhancement setting X may perform significantly less well when exposed to enhancement setting Y. In other fields of medicine, studies have shown that domain shift can significantly impact the reliability of AI systems [5 ]
[6 ]
[7 ]. The potential risk of domain shift is currently an underexplored issue for AI systems in gastrointestinal endoscopy.
Fig. 1 Examples of subtle differences in post-processing settings of endoscopy systems from different manufacturers: enhancement setting type A1 vs. A8 (left; Olympus, Tokyo, Japan), standard settings vs. tone and structure enhancement (middle; Fujifilm, Tokyo, Japan), and white-light endoscopy vs. I-SCAN 1 (right; Pentax, Tokyo, Japan; courtesy of Rehan Haidry).
Fig. 2 Two separate predictions of a computer-aided detection system for Barrett’s neoplasia. The same case is displayed twice with different levels of image enhancement. While the upper lesion is detected with high confidence, the same lesion projected with a different enhancement setting is missed.
One approach that can help mitigate the effects of domain shift is data augmentation. Data augmentation refers to the technique used to increase the diversity of a training dataset without actually collecting new data. It involves artificially creating new training images by applying transformations to existing images [8 ]. This can include operations such as rotating, flipping, and cropping images, adjusting contrast, or changing colors. The goal is to make AI systems more robust against new data that differ from the training dataset (i. e. domain shift).
A specialized form of data augmentation is image enhancement-based data augmentation. In this approach, the augmentations focus on transformations that closely align with the changes in post-processing enhancement settings available on endoscopy processors.
In this study, we evaluated the impact of domain shift due to post-processing enhancement settings on performance consistency of two endoscopic AI applications: computer-aided detection (CADe) of Barrett’s neoplasia and computer-aided diagnosis of colorectal polyps (CADx). We trained both CAD systems on datasets with limited variability of enhancement settings using standard data augmentation. We then tested these CAD systems across a wide range of enhancement settings resembling the heterogeneity encountered in daily clinical practice.
Subsequently, we evaluated image enhancement-based data augmentation methods to improve the performance consistency or robustness of both CAD systems against the image heterogeneity of enhancement settings.
Methods
Experimental setup
In this study we aimed to evaluate the impact of frequently used image enhancement settings on the performance of endoscopic AI systems. We investigated this based on two prevalent AI applications in endoscopy: CADe for primary detection of early Barrett’s neoplasia using white-light endoscopy, and CADx for optical diagnosis of colonic polyps using narrow-band imaging. For the current study, we redeveloped these two CAD systems based on datasets originating from published studies. First, we retrained these systems using their original datasets, which were acquired with limited variability in enhancement settings. These CAD systems were then tested across a wide range of test sets, each comprising exactly the same images, but displayed with different enhancement settings. Then, both CAD systems were retrained using specific data augmentation methods. The performance of these adjusted CAD systems was then re-evaluated on the same array of test sets.
Datasets for CAD systems
The CADe dataset consisted of Barrett’s esophagus images collected both retrospectively and prospectively for the development of a previously published CADe system by our own research group [2 ]
[9 ]. All data were acquired in 11 international Barrett’s referral centers. Images were carefully collected and curated according to standardized protocols to adhere to high quality standards. The dataset contained 3339 neoplastic images and 2884 nondysplastic images originating from 637 and 269 Barrett’s patients, respectively. All images were captured with the HQ190 gastroscope and the CV-190 processor (Olympus, Tokyo, Japan) using white-light endoscopy without magnification. The vast majority of images were captured with enhancement setting A5.
The CADx dataset consisted of colonoscopy images collected prospectively for the development of a CADx system for characterization of diminutive colorectal polyps, and is publicly available [10 ]. Data were acquired in eight Dutch hospitals. All images were collected following a specific workflow to maintain quality consistency across centers. The dataset was separated into a neoplastic group, with adenomas and sessile serrated lesions, and a non-neoplastic group, with hyperplastic polyps. The neoplastic group comprised 2746 images from 736 patients, while the non-neoplastic set contained 542 images originating from 233 patients. Images used in the current study were obtained with 190-series endoscopes and CV-190 processors (Olympus) using narrow-band imaging without magnification. Images were captured with enhancement settings A3 and A5.
Examples of both datasets are given in [Fig. 3 ].
Fig. 3 Cases featured in the dataset for computer-aided detection of Barrett’s neoplasia (upper row) and computer-aided diagnosis of diminutive colorectal polyps (lower row).
Enhancement settings
Both datasets were collected using the EXERA III CV190 processor from Olympus. This device offers multiple enhancement settings for both white-light endoscopy and narrow-band imaging. The most frequently used settings are:
enhancement type A: this setting enhances fine patterns in the image by increasing contrast and sharpness;
enhancement type B: this setting enhances even finer patterns than type A by more subtly increasing contrast and sharpness.
During an endoscopic procedure, the endoscopist can select one distinct setting and its degree of enhancement (1–8, where a higher number represents stronger enhancement). These settings, as illustrated in Fig. 1 s in the online-only Supplementary material, are typically preconfigured by the hospital’s maintenance or technical services team and remain unchanged throughout different procedures.
Conversion software
In this study, we used a proprietary software tool that was specifically developed to exactly replicate original EXERA III CV190 images with different enhancement settings. By inputting an image and its original enhancement setting, the software can generate equivalent images at other settings. The main differences between these settings are achieved by amplifying or reducing image sharpness and contrast. Comparative examples of original and converted images are provided in Fig. 2 s . As shown, the converted images are virtually indistinguishable from original images.
Data augmentation
In this study, two different data augmentation methods were used.
Standard data augmentation: this method included geometric transformations (e. g. rotation and flipping), color transformations (e. g. contrast and saturation adjustment), and filtering (e. g. blur and sharpness). These methods are commonly employed to increase the diversity of the training dataset, helping the model to generalize better to new data.
Image enhancement data augmentation: this method comprised a limited selection of standard augmentations such as geometric transformation in combination with image enhancement-based augmentations. For image enhancement data augmentations, we used the proprietary software tool to augment the training set with images across the complete spectrum of available enhancement settings.
Examples of several data augmentations are given in Fig. 3 s .
Development of CADe and CADx systems
For every dataset (CADe and CADx), two separate CAD systems were developed. Each was trained using either standard or image enhancement-based data augmentation. During training, a patient-based random split was executed, allocating 60 %, 20 %, and 20 % for training, validation, and testing, respectively. A ResNet-50 encoder, initialized with ImageNet pre-trained weights, was employed as this is a widely accepted and commonly used architecture. For all CAD systems, the operating threshold was optimized on its corresponding validation set comprising the same enhancement setting as the training set. Further details are given in the Supplementary material (see Algorithm development).
Evaluation of CAD systems
To evaluate these CAD systems, we used the following test sets.
Original test set: this dataset comprised the CAD systems’ original, unaltered test set images, which were acquired using limited variability in enhancement settings.
Simulated test sets: the software tool was then used to convert the original test set to generate new test sets with different enhancement settings. As the tool offers 16 different enhancement settings (i. e. A1-A8, B1-B8), this resulted in 16 simulated test sets per CAD system.
Outcome measures
The study outcomes were:
baseline performance (sensitivity and specificity) of CADe and CADx, trained with standard or image enhancement-based data augmentation, on their original test set;
performance variability (mean and range of sensitivity and specificity) of CADe and CADx, trained with standard or image enhancement-based data augmentation, across simulated test sets.
Post hoc analysis
The original training sets of CADe and CADx were acquired using moderate enhancement settings (A3 and A5). In a post hoc analysis, we investigated more extreme enhancement settings. We analyzed scenarios where a CAD system is trained on one end of the spectrum of enhancement settings (e. g. A1) and applied in clinical practice with settings from the opposite end (e. g. A8).
Statistical analysis
Sensitivity and specificity were chosen as the primary performance metrics in this study. Sensitivity represents the proportion of true positives among all positive cases, while specificity denotes the proportion of true negatives among all negative cases. In the CADe dataset for Barrett’s neoplasia detection, positive cases were defined as images showing neoplastic tissue, while negative cases were nondysplastic Barrett’s images. In the CADx dataset for colorectal polyp characterization, positives included images of neoplastic polyps (adenomas and sessile serrated lesions), and negatives consisted of non-neoplastic polyps (e. g. hyperplastic polyps).
We report sensitivity and specificity for the original test sets to establish baseline performance under the original enhancement settings. Confidence intervals were calculated using the Wilson method. To evaluate variability in performance across different enhancement settings, we tested the systems on an array of simulated test sets generated with varied enhancement settings. For these simulated test sets, we present the median sensitivity and specificity across all settings, alongside the full range of these values to illustrate the extent of performance variability. For a more granular view of this distribution, we provide receiver operating characteristic scatterplots displaying results for each individual simulated test set. To test whether image enhancement-based data augmentation decreased the variability of sensitivity and specificity, we performed a one-sided nonparametric Mood test, using the R package “coin” (R Foundation for Statistical Computing, Vienna, Austria).
Results
Performance of CAD systems using standard data augmentation
For Barrett’s neoplasia detection, the CADe system trained with standard data augmentation displayed a baseline performance of 90 % sensitivity and 89 % specificity on its original test set. On the simulated test sets comprising a wide variety of enhancement settings, the CADe system reached median sensitivity and specificity of 87 % and 89 %, respectively. Performance varied substantially between individual sets. Sensitivity ranged between 83 % and 92 % (Δ 9 %) and specificity ranged between 84 % and 91 % (Δ 7 %).
The CADx system for colorectal polyps, trained using standard data augmentation, reached a baseline performance of 79 % sensitivity and 63 % specificity on its original test set. For the simulated test sets, the CADx system displayed median sensitivity and specificity of 79 % and 59 %, respectively. Similarly to CADe, performance variation was clear. Sensitivity ranged from 78 % to 85 % (Δ 7 %) and specificity ranged from 45 % to 63 % (Δ 18 %).
An example case of substantial performance variability across different enhancement settings is given in [Fig. 4 ].
Fig. 4 Example case of performance variability of the computer-aided detection system trained with standard data augmentation and image enhancement-based data augmentation. Image enhancement-based data augmentation results in substantially more stable predictions.
Performance of CAD systems using image enhancement-based data augmentation
After retraining with image enhancement-based data augmentation, the CADe system showed a baseline performance of 90 % sensitivity and 90 % specificity on the original test set. On the simulated test sets, median sensitivity and specificity were 90 % and 90 %, respectively. Performance variability was significantly lower compared with standard data augmentation. Sensitivity ranged from 89 % to 91 % (Δ 2 %; P < 0.001), and specificity ranged from 90 % to 91 % (Δ 1 %; P = 0.003).
The retrained CADx system displayed a baseline performance of 78 % sensitivity and 63 % specificity on the original test set. On simulated test sets, the retrained CADx system reached median sensitivity and specificity of 79 % and 60 %, respectively. Performance variability was limited, with sensitivity ranging from 78 % to 80 % (Δ 2 %; P = 0.03) and specificity ranging from 55 % to 63 % (Δ 8 %; P = 0.190).
All results are summarized in [Table 1 ] and illustrated in [Fig. 5 ]. Results on individual test sets are given in Table 1 s .
Table 1
Results of computer-aided detection and diagnosis systems trained with standard and image enhancement-based data augmentation.
AI application
Original test set
Simulated test sets
Median
Range (min–max)
P value
CADe
Sensitivity, %
90
87
9 (83–92)
< 0.001
90
90
2 (89–91)
Specificity, %
89
89
7 (84–91)
0.003
90
90
1 (90–91)
CADx
Sensitivity, %
79
79
7 (78–85)
0.03
78
79
2 (78–80)
Specificity, %
63
59
18 (45–63)
0.19
63
60
8 (55–63)
CADe, computer-aided detection; CADx, computer-aided diagnosis.
Fig. 5 Performance variability of both computer-aided detection (CADe) and computer-aided diagnosis (CADx) using standard or image enhancement-based data augmentation. Each dot represents the performance of the respective CAD system on a test set comprising one specific enhancement setting.
Post hoc analysis When the CADe system was trained with standard data augmentation using a different, more extreme enhancement setting (i. e. A1, A8, B1, or B8) and tested on all simulated test sets, performance variability increased even further. For instance, the CADe system trained solely on images with enhancement setting A8 displayed sensitivity scores between 66 % and 89 % (Δ 23 %). For CADx, the system trained on images with setting B1 reached specificity scores ranging between 31 % and 61 % (Δ 30 %). All post hoc results are displayed in Table 2 s , Table 3 s , and Fig. 4 s .
Discussion
Current endoscopy systems offer a wide range of post-processing enhancement settings that endoscopist may use to adjust image characteristics such as color, texture, and contrast. We investigated whether these settings, while often subtle to the human eye, affect the performance of endoscopic AI systems. We tested this on two commonly used endoscopic AI applications, detection of early Barrett’s neoplasia and characterization of diminutive colorectal polyps, and found a remarkable impact of enhancement settings on AI performance.
When we trained and tested both CAD systems on their original datasets without changing enhancement settings of the original images, the systems performed as expected. However, when we evaluated the systems’ performances on test sets with different enhancement settings, performance varied substantially. For instance, the CADe system for detection of Barrett’s neoplasia showed sensitivity varying between 83 % and 92 %: a 9 % absolute difference in neoplasia detection. For the CADx system characterizing colorectal polyps, the variability in performance was even more profound, with specificity scores differing up to 18 %. When we trained the systems with the extremes of the enhancement settings, performance variability increased even more. For example, a CADe system trained only on images with setting B1 showed specificity between 55 % and 88 % depending on the enhancement setting of the test sets.
These findings are significant, as all major manufacturers of endoscopy systems offer a variety of company-specific enhancement modes. Often, these settings are preconfigured by the hospital’s technical services team. Most endoscopist do not adjust these settings further and many may not even be aware of them. As there is no clear default mode, these settings may differ widely between hospitals or even between individual endoscopy suites. Meanwhile, AI systems are typically neither trained nor optimized to perform consistently across the full range of these settings, which may contribute to the lack of generalizability observed when AI systems are deployed outside of the centers where they were initially developed [11 ]
[12 ]
[13 ]. For example, if an AI system trained on data with high-contrast settings is deployed in a clinic where low-contrast settings are standard, it may fail to detect subtle lesions due to the lower visual contrast, potentially leading to missed diagnoses. Conversely, other enhancement settings might amplify the AI system’s sensitivity, causing it to often flag non-neoplastic mucosa as suspicious. This could lead to a high false-positive rate, creating “alert fatigue” in clinicians.
Fortunately, there may be an effective solution to this issue. As the main problem is the lack of heterogeneity of enhancement settings in the training data, an intuitive solution is to introduce the complete spectrum of enhancement settings into the training data. Enhancement settings are based on post-processing transformations. Therefore, this does not require collection of additional real-world clinical data. Rather, the solution may be found with image enhancement-based data augmentation. Data augmentation is a standard step in training AI algorithms in endoscopy. It generally involves making slight modifications to the original images in the training set – such as rotations, color adjustments, or scaling – to artificially expand the dataset and introduce a wider variety of training examples. Image enhancement-based data augmentation involves restricting these augmentations to endoscopy-specific settings, in this case, the variability induced by enhancement settings. Using a proprietary software tool, we duplicated all images in the CAD systems’ training sets to create new images comprising all different enhancement settings. This approach proved to be successful, as both the CADe and CADx systems displayed significantly more stable performance: CADe sensitivity and specificity differences were limited to 2 % and 1 %, while CADx differences were 2 % and 8 %, respectively.
We recognize that this study has some limitations. First, this study is limited by its ex vivo design using simulated still images. Images were converted to different enhancement settings after initial acquisition and storage, using a proprietary software tool that exactly replicates these settings. Although a prospective collection of matched images with different enhancement settings would be ideal, it is impractical, if not impossible, to capture matched pairs across all enhancement settings for each image. Additionally, varying capture devices and compression techniques could introduce further variability, potentially distorting results. In contrast, our ex vivo approach allowed for a standardized, paired analysis across 16 enhancement settings. Moreover, as shown in Fig. 2 s , the converted images were nearly indistinguishable from the original data, supporting the validity of this approach. Second, despite the fact that the evaluated datasets were aimed at different endoscopic applications, the datasets originated from the same manufacturer and endoscope series. Other endoscopy systems were not investigated. Yet it is conceivable that AI systems on these platforms will suffer from similar issues. Third, the current proposed method for image enhancement-based data augmentation remains a quick fix and may not be sustainable long term. Ideally, a more generic method would be developed that is applicable to all current post-processing settings and future settings yet to be developed. Fourth, this study is specifically aimed at post-processing image enhancement. There may be other sources of domain shift that can form a similar threat to model generalizability, such as other virtual chromoendoscopy techniques, differing generations of endoscopes, and quality of the endoscopic imaging procedure.
In conclusion, this study highlights the impact of different enhancement settings on performance variability of endoscopic AI systems. This should be taken into account during future development and implementation of AI in endoscopy. Collecting more diverse data during the development phase may offer the most direct solution, but specific data augmentation techniques to simulate enhancement setting variability or generative AI solutions could be pragmatic alternatives and deserve further investigation.