Subscribe to RSS
DOI: 10.1055/a-2388-6084
The role of generative language systems in increasing patient awareness of colon cancer screening
Abstract
Background This study aimed to evaluate the effectiveness of ChatGPT (Chat Generative Pretrained Transformer) in answering patientsʼ questions about colorectal cancer (CRC) screening, with the ultimate goal of enhancing patients' awareness and adherence to national screening programs.
Methods 15 questions on CRC screening were posed to ChatGPT4. The answers were rated by 20 gastroenterology experts and 20 nonexperts in three domains (accuracy, completeness, and comprehensibility), and by 100 patients in three dichotomic domains (completeness, comprehensibility, and trustability).
Results According to expert rating, the mean (SD) accuracy score was 4.8 (1.1), on a scale ranging from 1 to 6. The mean (SD) scores for completeness and comprehensibility were 2.1 (0.7) and 2.8 (0.4), respectively, on scales ranging from 1 to 3. Overall, the mean (SD) accuracy (4.8 [1.1] vs. 5.6 [0.7]; P < 0.001) and completeness scores (2.1 [0.7] vs. 2.7 [0.4]; P < 0.001) were significantly lower for the experts than for the nonexperts, while comprehensibility was comparable among the two groups (2.8 [0.4] vs. 2.8 [0.3]; P = 0.55). Patients rated all questions as complete, comprehensible, and trustable in between 97 % and 100 % of cases.
Conclusions ChatGPT shows good performance, with the potential to enhance awareness about CRC and improve screening outcomes. Generative language systems may be further improved after proper training in accordance with scientific evidence and current guidelines.
#
Introduction
Colorectal cancer (CRC) remains a paramount healthcare concern globally, ranking as the third most prevalent cancer in men and the second most prevalent in women; it is the fourth leading cause of cancer fatalities worldwide [1]. Early detection and the removal of precancerous and cancerous lesions through CRC screening may significantly reduce disease incidence and mortality [2] [3] [4]. To date, numerous interventions have been proposed to increase screening uptake [5]. Among these, an electronic patient portal has been shown to increase adherence to CRC screening and improve the time to screening completion [6] [7]. Despite these efforts, adherence to CRC screening remains suboptimal, ranging from 30 % to 80 % [8] [9] [10]. This may be because people have limited knowledge of the effectiveness of screening, the different screening techniques, or the fear of undergoing invasive tests such as colonoscopy.
Artificial intelligence (AI)-assisted chatbots such as ChatGPT (Chat Generative Pretrained Transformer; OpenAI Foundation) are emerging as a revolutionary tool that has the potential to provide educational support to patients through an accessible question–answer system [11]. As it has the potential to serve as a complementary resource for healthcare, it is currently being evaluated in various areas of medicine, including gastroenterology.
A recent study assessed the performance of ChatGPT in providing answering responses for patients with nonalcoholic fatty liver disease, showing high accuracy, reliability, and comprehensibility [12]. Similar studies evaluating ChatGPT in gastroenterology have been performed in the settings of acute pancreatitis [13] and Helicobacter pylori [14], and for patient questions about colonoscopy [15]. In line with these researches, this study aimed to evaluate the effectiveness of ChatGPT in answering patients' questions about CRC screening, with the ultimate goal of enhancing patients' awareness and adherence to national screening programs worldwide.
#
Methods
The study took place from February to June 2024, across Europe and the USA. As neither patient-identifiable data nor intervention approaches were used, institutional review board approval was not required.
A working group composed of three authors (M.M., D.R., and C.H.) created a list of 15 questions on CRC screening and its diagnostic and therapeutic implications ([Table 1]). Questions were focused on general screening (Q1–Q6 and Q11), and endoscopic (Q7–Q10) and therapeutic measures (Q12–Q15). Upon reaching a consensus on the list of questions, the working group entered the queries into ChatGPT (version GPT 4.0) and registered the corresponding answers (Appendix 1 s, see online-only Supplementary material).
Subsequently, the responses were reviewed in parallel by 20 experts and 20 nonexperts. Experts were well-recognized international gastroenterologists with extensive clinical and scientific experience in gastroenterology and CRC screening. Nonexperts were physicians who were not board-certified in gastroenterology, who lacked specific expertise in CRC screening.
Both experts and nonexperts scored each response according to three domains – accuracy, completeness, and comprehensibility – using Likert scales that ranged between 1 and 6 for accuracy, and between 1 and 3 for completeness and comprehensibility. The use of these domains to evaluate the answers of ChatGPT, as well as their definitions, was derived from a previous similar study [12].
The ChatGPT responses were also given to 100 consecutive patients, aged 50–69 years, who were participating in the Italian national CRC screening program, who rated each response dichotomously (yes/no) as being understandable, complete, and trustable. For this purpose, the responses were first translated into Italian by a native Italian–English speaker to ensure high accuracy close to the original.
The results were analyzed, with mean (SD) reported for continuous variables and frequency and percentage for categorical variables. Comparisons of the variables were made by t test and chi-squared test as appropriate. A P value of < 0.05 was considered as indicating statistical significance.
The internal consistency of the scale was assessed using the Cronbach alpha coefficient and cutoff points of < 60 %, 61 %–70 %, 71 %–80 %, 81 %–90, and > 90 % were considered to suggest poor, questionable, acceptable, good, and excellent reliability, respectively [16].
All statistical analyses were performed using SPSS v. 29.0 for Macintosh (SPSS Inc., Chicago, Illinois, USA).
#
Results
Expert assessment
According to the expert assessment, the mean (SD) accuracy score, on a scale ranging from 1 to 6 points, was 4.8 (1.1), with 10/15 questions (Q1,Q2,Q7–Q12,Q14,Q15) receiving a mean rating ≥ 5 (“nearly all correct” judgment) ([Fig. 1a]; [Table 2]). Q12 and Q14 received the highest scores (5.2), while Q6 had the lowest score (3.9).
1 6-point Likert scale: 1, completely incorrect; 2, more incorrect than correct; 3, approximately equally correct and incorrect; 4, more correct than incorrect; 5, nearly all correct; 6, correct; ≥ 4 points was considered accurate.
2 3-point Likert scale: 1, incomplete (addresses some aspects of the question, but significant parts are missing or incomplete); 2, adequate (addresses all aspects of the question and provides the minimum amount of information required to be considered complete); 3, comprehensive (addresses all aspects of the question and provides additional information or context beyond what was expected); ≥ 2 points was considered complete.
3 3-point Likert scale: 1, difficult to understand; 2, partially difficult to understand; 3, easy to understand; 3 points was considered comprehensible.
The mean (SD) completeness score, on a scale ranging from 1 to 3 points, was 2.1 (0.7), with all questions except Q3, Q6, and Q10 receiving a mean rating ≥ 2 (“adequate” judgment) ([Fig. 1b]). Q1 and Q14 received the highest scores (2.5 and 2.6, respectively), while Q10 received the lowest score (1.7).
Concerning comprehensibility, the mean (SD) score, on a scale ranging from 1 to 3, was 2.8 (0.4) ([Fig. 1c]). All questions received a rating ≥ 2.5 points (where 3 = “easy to understand”). Among these, Q3, Q4, and Q13 received the lowest scores (2.6).
The internal consistency of responses using Cronbach alpha coefficient was excellent for accuracy (0.92) and completeness (0.91), and acceptable for comprehensibility (0.77).
#
Nonexpert assessment
Nonexpert physicians were internal medicine doctors (55 %), general practitioners (20 %), pulmonologists, geriatricians, nephrologists, and radiologists (25 %).
The mean (SD) accuracy score, on a scale of 1 to 6, was 5.6 (0.7), with all questions receiving a mean rating ≥ 5 (“nearly all correct” judgment) ([Table 3]). The mean (SD) completeness score, on a scale of 1 to 3, was 2.7 (0.4), with all questions receiving a mean rating ≥ 2 (“adequate” judgment). The mean (SD) comprehensibility score, on a scale of 1 to 3, was 2.8 (0.3), with all questions scored ≥ 2.5 points (where 3 = “easy to understand”).
1 6-point Likert scale: 1, completely incorrect; 2, more incorrect than correct; 3, approximately equally correct and incorrect; 4, more correct than incorrect; 5, nearly all correct; 6, correct; ≥ 4 points was considered accurate.
2 3-point Likert scale: 1, incomplete (addresses some aspects of the question, but significant parts are missing or incomplete); 2, adequate (addresses all aspects of the question and provides the minimum amount of information required to be considered complete); 3, comprehensive (addresses all aspects of the question and provides additional information or context beyond what was expected); ≥ 2 points was considered complete.
3 3-point Likert scale: 1, difficult to understand; 2, partially difficult to understand; 3, easy to understand; 3 points was considered comprehensible.
The internal consistency of responses using Cronbach alpha coefficient was excellent for accuracy (0.97), and good for completeness (0.85) and comprehensibility (0.89).
Overall, in comparison with the nonexperts, the experts had significantly lower mean (SD) scores for accuracy (4.8 [1.1] vs. 5.6 [0.7]; P < 0.001) and completeness (2.1 [0.7] vs. 2.7 [0.4]; P < 0.001). In contrast, the mean (SD) comprehensibility scores were comparable for the two groups (2.8 [0.4] vs. 2.8 [0.3]; P = 0.55). The full comparisons between the experts' and nonexpertsʼ ratings for each question are shown in Table 1 s.
#
Patients’ assessment
The ChatGPT responses were given to 100 consecutive patients, aged between 50 and 69 years, admitted for CRC screening at the University Hospital of Enna “Kore.”
On the whole, each response received a completeness rating between 99 % and 100 %, with 10/15 questions being deemed complete by 100 % of the patients ([Table 4]). Likewise, the responses were considered comprehensible in 97 %–100 % of cases, with 12/15 questions being deemed comprehensible by 100 % of the patients. Finally, the responses were deemed trustworthy in 99 %–100 % of cases, with 14/15 questions being deemed trustworthy by 100 % of the patients.
#
#
Discussion
The results from this study show that ChatGPT performs well in providing patients with screening information, which has the potential to enhance awareness about CRC. The expert assessment showed high levels of accuracy and completeness, and an excellent level of comprehensibility. Although experts rated the accuracy as fair in two questions (Q3, which describes the screening process, and Q6, which explains the recall strategy after FIT positivity), the responses were still correct and provided a clear hierarchy and timing of the different exams. Similarly, completeness was rated as fair in three questions: Q3 and Q6, mentioned above, and Q10, which explains how to perform bowel preparation before colonoscopy. The answer to this last question was short and not very exhaustive on a topic that represents a very important quality indicator for colonoscopy and the detection of lesions.
As expected, the evaluation scores given by nonexperts were significantly higher in terms of accuracy and completeness, because specialists are generally more critical on topics within their knowledge. The comprehensibility ratings were however comparable, as both the experts and nonexperts are familiar with the medical language.
Patients do not have the skills to judge the accuracy of responses, so were not asked about this domain; however, when asked to comment on the completeness, comprehensibility, and trustability of the answers, they reported very high ratings. This confirms that, despite containing medical terminology, all the questions were found to be very comprehensive and easily understandable from the perspective of patients. Moreover, almost all of them expressed a high level of trust in the accuracy of the answers and, consequently, in the tool itself. This is relevant since generative language systems are usually built on statistical models that are not aimed at presenting accurate information, but rather at creating the impression of doing so by mimicking human speech or writing, and this is not always successful.
While the results of this study demonstrate the good performance of ChatGPT, it is important to note that this tool is not intended to replace medical consultations. In many instances, patients require a medical interview to address concerns and receive appropriate clarifications. Furthermore, engaging in discussion with a physician is essential to address complex questions related to concurrent medical conditions and medications, and to provide personalized care to patients.
To the best of our knowledge, this is the first study assessing the performance of a pretrained generative language system in responding to CRC screening-related inquiries including an evaluation of both doctors and patients. The study had several strengths. The evaluation of the responses was conducted by a large number of CRC experts from both Europe and the USA, ensuring the high reliability of the gold standard used in the study. Additionally, the assessment was also carried out in parallel by non-gastroenterologists to compare the responses between experts and nonexperts, as well as by patients, to gauge user perception. Moreover, all assessments by healthcare personnel were obtained through a quantitative scale, allowing for homogeneous comparability of the ratings.
This study also presents some limitations. First, the assessment of ChatGPT is limited to the study setting, so it cannot be generalized beyond these 15 questions and the evaluation of the raters. Second, it must be considered that the responses generated by pretrained regenerative language programs change based on the input of information, thereby resulting in poor reproducibility. Third, in this study, the questions were formulated by physicians, which represents a potential bias. Indeed, patient-generated questions may be less focused, affecting the quality of the responses provided by the chatbot. Finally, the questionnaire was disseminated only to Italian patients, limiting the external validity of the results. For this purpose, the form was translated into Italian from the original English version, which was very similar, though not identical. Nevertheless, given the high number of positive rates, it is likely that the language had little influence on patients' assessments.
In conclusion, this study shows that ChatGPT performs well in responding to patient CRC screening questions, with the potential to enhance awareness about CRC and improve screening outcomes. Nonetheless, these results must be interpreted with caution as they refer to a specific setting and cannot be generalized to overall ChatGPT performance. In the future, generative language systems will need further improvement to provide medical-specific versions that are trained in accordance with up-to-date scientific evidence and current guidelines.
#
#
Conflict of Interests
Y. Mori has received consulting and speaking fees, plus equipment loan from Olympus, and loyalties from Cybernet System Corp. M. Maida, D. Ramai, M. Dinis-Ribeiro, A. Facciorusso, and C. Hassan declare that they have no conflicts of interest.
-
References
- 1 Fitzmaurice C, Dicker D, Pain A. et al. Global burden of disease cancer collaboration. The global burden of cancer 2013. JAMA Oncol 2015; 1: 505-527
- 2 Løberg M, Kalager M, Holme Ø. et al. Long-term colorectal-cancer mortality after adenoma removal. NEJM 2014; 371: 799-807
- 3 Bretthauer M, Løberg M, Wieszczy P. et al. Effect of colonoscopy screening on risks of colorectal cancer and related death. NEJM 2022; 387: 1547-1556
- 4 Hewitson P, Glasziou P, Irwig L. et al. Screening for colorectal cancer using the faecal occult blood test, Hemoccult. Cochrane Database Syst Rev 2007; 2007: CD001216
- 5 Tsipa A, O'Connor DB, Branley-Bell D. et al. Promoting colorectal cancer screening: a systematic review and meta-analysis of randomised controlled trials of interventions to increase uptake. Health Psychol Rev 2021; 15: 371-394
- 6 Hahn EE, Baecker A, Shen E. et al. A patient portal-based commitment device to improve adherence with screening for colorectal cancer: a retrospective observational study. J Gen Intern Med 2021; 36: 952-960
- 7 Goshgarian G, Sorourdi C, May FP. et al. Effect of patient portal messaging before mailing fecal immunochemical test kit on colorectal cancer screening rates: a randomized clinical trial. JAMA Netw Open 2022; 5: e2146863
- 8 Klabunde C, Blom J, Bulliard JL. et al. Participation rates for organized colorectal cancer screening programmes: an international comparison. J Med Screen 2015; 22: 119-126
- 9 McNamara D, Leen R, Seng-Lee C. et al. Sustained participation, colonoscopy uptake and adenoma detection rates over two rounds of the Tallaght-Trinity College colorectal cancer screening programme with the faecal immunological test. Eur J Gastroenterol Hepatol 2014; 26: 1415-1421
- 10 Kapidzic A, Grobbee EJ, Hol L. et al. Attendance and yield over three rounds of population-based fecal immuno- chemical test screening. Am J Gastroenterol 2014; 109: 1257-1264
- 11 OpenAI. (2023). ChatGPT (Mar 14 version). Available at (Accessed 5 February 2024): https://chat.openai.com
- 12 Pugliese N, Wai-Sun Wong V, Schattenberg JM. et al. Accuracy, reliability, and comprehensibility of ChatGPT-generated medical responses for patients with nonalcoholic fatty liver disease. Clin Gastroenterol Hepatol 2024; 22: 886-889.e5
- 13 Du RC, Liu X, Lai YK. et al. Exploring the performance of ChatGPT on acute pancreatitis-related questions. J Transl Med 2024; 22: 527
- 14 Lai Y, Liao F, Zhao J. et al. Exploring the capacities of ChatGPT: A comprehensive evaluation of its accuracy and repeatability in addressing helicobacter pylori-related queries. Helicobacter 2024; 29: e13078
- 15 Lee TC, Staller K, Botoman V. et al. ChatGPT answers common patient questions about colonoscopy. Gastroenterology 2023; 165: 509-511.e7
- 16 Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika 1951; 16: 297-334
Corresponding author
Publication History
Received: 14 March 2024
Accepted after revision: 14 August 2024
Accepted Manuscript online:
14 August 2024
Article published online:
23 October 2024
© 2024. Thieme. All rights reserved.
Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany
-
References
- 1 Fitzmaurice C, Dicker D, Pain A. et al. Global burden of disease cancer collaboration. The global burden of cancer 2013. JAMA Oncol 2015; 1: 505-527
- 2 Løberg M, Kalager M, Holme Ø. et al. Long-term colorectal-cancer mortality after adenoma removal. NEJM 2014; 371: 799-807
- 3 Bretthauer M, Løberg M, Wieszczy P. et al. Effect of colonoscopy screening on risks of colorectal cancer and related death. NEJM 2022; 387: 1547-1556
- 4 Hewitson P, Glasziou P, Irwig L. et al. Screening for colorectal cancer using the faecal occult blood test, Hemoccult. Cochrane Database Syst Rev 2007; 2007: CD001216
- 5 Tsipa A, O'Connor DB, Branley-Bell D. et al. Promoting colorectal cancer screening: a systematic review and meta-analysis of randomised controlled trials of interventions to increase uptake. Health Psychol Rev 2021; 15: 371-394
- 6 Hahn EE, Baecker A, Shen E. et al. A patient portal-based commitment device to improve adherence with screening for colorectal cancer: a retrospective observational study. J Gen Intern Med 2021; 36: 952-960
- 7 Goshgarian G, Sorourdi C, May FP. et al. Effect of patient portal messaging before mailing fecal immunochemical test kit on colorectal cancer screening rates: a randomized clinical trial. JAMA Netw Open 2022; 5: e2146863
- 8 Klabunde C, Blom J, Bulliard JL. et al. Participation rates for organized colorectal cancer screening programmes: an international comparison. J Med Screen 2015; 22: 119-126
- 9 McNamara D, Leen R, Seng-Lee C. et al. Sustained participation, colonoscopy uptake and adenoma detection rates over two rounds of the Tallaght-Trinity College colorectal cancer screening programme with the faecal immunological test. Eur J Gastroenterol Hepatol 2014; 26: 1415-1421
- 10 Kapidzic A, Grobbee EJ, Hol L. et al. Attendance and yield over three rounds of population-based fecal immuno- chemical test screening. Am J Gastroenterol 2014; 109: 1257-1264
- 11 OpenAI. (2023). ChatGPT (Mar 14 version). Available at (Accessed 5 February 2024): https://chat.openai.com
- 12 Pugliese N, Wai-Sun Wong V, Schattenberg JM. et al. Accuracy, reliability, and comprehensibility of ChatGPT-generated medical responses for patients with nonalcoholic fatty liver disease. Clin Gastroenterol Hepatol 2024; 22: 886-889.e5
- 13 Du RC, Liu X, Lai YK. et al. Exploring the performance of ChatGPT on acute pancreatitis-related questions. J Transl Med 2024; 22: 527
- 14 Lai Y, Liao F, Zhao J. et al. Exploring the capacities of ChatGPT: A comprehensive evaluation of its accuracy and repeatability in addressing helicobacter pylori-related queries. Helicobacter 2024; 29: e13078
- 15 Lee TC, Staller K, Botoman V. et al. ChatGPT answers common patient questions about colonoscopy. Gastroenterology 2023; 165: 509-511.e7
- 16 Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika 1951; 16: 297-334