RSS-Feed abonnieren

DOI: 10.1055/a-2528-4299
Leveraging Guideline-Based Clinical Decision Support Systems with Large Language Models: A Case Study with Breast Cancer

Abstract
Background Multidisciplinary tumor boards (MTBs) have been established in most countries to allow experts collaboratively determine the best treatment decisions for cancer patients. However, MTBs often face challenges such as case overload, which can compromise MTB decision quality. Clinical decision support systems (CDSSs) have been introduced to assist clinicians in this process. Despite their potential, CDSSs are still underutilized in routine practice. The emergence of large language models (LLMs), such as ChatGPT, offers new opportunities to improve the efficiency and usability of traditional CDSSs.
Objectives OncoDoc2 is a guideline-based CDSS developed using a documentary approach and applied to breast cancer management. This study aims to evaluate the potential of LLMs, used as question-answering (QA) systems, to improve the usability of OncoDoc2 across different prompt engineering techniques (PETs).
Methods Data extracted from breast cancer patient summaries (BCPSs), together with questions formulated by OncoDoc2, were used to create prompts for various LLMs, and several PETs were designed and tested. Using a sample of 200 randomized BCPSs, LLMs and PETs were initially compared with regard to their responses to OncoDoc2 questions using classic metrics (accuracy, precision, recall, and F1 score). Best performing LLMs and PETs were further assessed by comparing the therapeutic recommendations generated by OncoDoc2, based on LLM inputs, to those provided by MTB clinicians using OncoDoc2. Finally, the best performing method was validated using a new sample of 30 randomized BCPSs.
Results The combination of Mistral and OpenChat models under the enhanced Zero-Shot PET showed the best performance as a question-answering system. This approach gets a precision of 60.16%, a recall of 54.18%, an F1 score of 56.59%, and an accuracy of 75.57% on the validation set of 30 BCPSs. However, this approach yielded poor results as a CDSS, with only 16.67% of the recommendations generated by OncoDoc2 based on LLM inputs matching the gold standard.
Conclusion All the criteria in the OncoDoc2 decision tree are crucial for capturing the uniqueness of each patient. Any deviation from a criterion alters the recommendations generated. Despite achieving a good accuracy rate of 75.57%, LLMs still face challenges in reliably understanding complex medical contexts and be effective as CDSSs.
Keywords
clinical decision support systems - OncoDoc2 - breast cancer - large language models - question-answering systemsPublikationsverlauf
Eingereicht: 25. September 2024
Angenommen: 18. Januar 2025
Accepted Manuscript online:
29. Januar 2025
Artikel online veröffentlicht:
16. April 2025
© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/)
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany
-
References
-
1 Global cancer burden growing, amidst mounting need for services. https://www.who.int/news/
- 2 Muhiyaddin R, Abd-Alrazaq AA, Househ M, Alam T, Shah Z. The impact of clinical decision support systems (CDSS) on physicians: a scoping review. Stud Health Technol Inform 2020; 272: 470-473
- 3 Novikava N, Redjdal A, Bouaud J, Seroussi B. Clinical decision support systems applied to the management of breast cancer patients: a scoping review. Stud Health Technol Inform 2023; 305: 353-356
- 4 Séroussi B, Bouaud J, Antoine EC. ONCODOC: a successful experiment of computer-supported guideline development and implementation in the treatment of breast cancer. Artif Intell Med 2001; 22 (01) 43-64
- 5 Séroussi B, Bouaud J, Gligorov J, Uzan S. Supporting multidisciplinary staff meetings for guideline-based breast cancer management: a study with OncoDoc2. AMIA Annu Symp Proc 2007; 2007: 656-660
- 6 Blacker SN, Kang M, Chakraborty I. et al. Utilizing artificial intelligence and chat generative pretrained transformer to answer questions about clinical scenarios in neuroanesthesiology. J Neurosurg Anesthesiol 2023; . Epub ahead of print
- 7 Luchini C, Pea A, Scarpa A. Artificial intelligence in oncology: current applications and future perspectives. Br J Cancer 2022; 126 (01) 4-9
-
8
Mitra A,
Del Corro L,
Mahajan S.
et al.
Orca 2: teaching small language models how to reason. Published online November 21, 2023
-
9
Touvron H,
Martin L,
Stone K.
et al.
Llama 2: open foundation and fine-tuned chat models. Published online July 19, 2023
-
10
Gemma Team.
Mesnard T,
Hardin C,
Dadashi R.
et al.
Gemma: open models based on Gemini research and technology. Published online April 16, 2024
-
11
Wang G,
Cheng S,
Zhan X,
Li X,
Song S,
Liu Y.
OpenChat: advancing open-source language models with mixed-quality data. Published online March 16, 2024
-
12
Jiang AQ,
Sablayrolles A,
Mensch A.
et al.
Mistral 7B. Published online October 10, 2023
-
13
Jiang AQ,
Sablayrolles A,
Roux A.
et al.
Mixtral of experts. Published online January 8, 2024
- 14 Ollama. 2024 . Accessed at: https://ollama.com
-
15 Carbone Tracker, lco2/codecarbon. 2024. Available at: https://github.com/mlco2/codecarbon
-
16
Minaee S,
Mikolov T,
Nikzad N.
et al.
Large language models: a survey. Published online February 20, 2024
-
17
Jin F,
Liu Y,
Tan Y.
Zero-shot chain-of-thought reasoning guided by evolutionary algorithms in large language models. Published online February 7, 2024
- 18 Delourme S, Redjdal A, Bouaud J, Seroussi B. Measured performance and healthcare professional perception of large language models used as clinical decision support systems: a scoping review. Stud Health Technol Inform 2024; 316: 841-845
- 19 Yu P, Xu H, Hu X, Deng C. Leveraging generative AI and large language models: a comprehensive roadmap for healthcare integration. Healthcare (Basel) 2023; 11 (20) 2776
- 20 Fisch U, Kliem P, Grzonka P, Sutter R. Performance of large language models on advocating the management of meningitis: a comparative qualitative study. BMJ Health Care Inform 2024; 31 (01) e100978
- 21 Shiraishi M, Lee H, Kanayama K, Moriwaki Y, Okazaki M. Appropriateness of artificial intelligence chatbots in diabetic foot ulcer management. Int J Low Extrem Wounds 2024 ;x:15347346241236811
- 22 Rillig MC, Ågerstrand M, Bi M, Gould KA, Sauerland U. Risks and benefits of large language models for the environment. Environ Sci Technol 2023; 57 (09) 3464-3466
- 23 Sorin V, Klang E, Sklair-Levy M. et al. Large language model (ChatGPT) as a support tool for breast tumor board. NPJ Breast Cancer 2023; 9 (01) 44
- 24 Lukac S, Dayan D, Fink V. et al. Evaluating ChatGPT as an adjunct for the multidisciplinary tumor board decision-making in primary breast cancer cases. Arch Gynecol Obstet 2023; 308 (06) 1831-1844
- 25 Griewing S, Gremke N, Wagner U, Lingenfelder M, Kuhn S, Boekhoff J. Challenging ChatGPT 3.5 in senology—an assessment of concordance with breast cancer tumor board decision making. J Pers Med 2023; 13 (10) 1502
-
26
Li M,
Huang J,
Yeung J.
et al.
CancerLLM: a large language model in cancer domain. Published online September 1, 2024
-
27
Sahoo P,
Singh AK,
Saha S,
Jain V,
Mondal S,
Chadha A.
A systematic survey of prompt engineering in large language models: techniques and applications. Published online February 5, 2024