A Survey of Clinicians' Views of the Utility of Large Language Models

Matthew Spotnitz; Betina Idnay; Emily R. Gordon; Rebecca Shyu; Gongbo Zhang; Cong Liu; James J. Cimino; Chunhua Weng

doi:10.1055/a-2281-7092

Applied Clinical Informatics, Table of Contents

CC BY-NC-ND 4.0 · Appl Clin Inform 2024; 15(02): 306-312
DOI: 10.1055/a-2281-7092

Research Article

A Survey of Clinicians' Views of the Utility of Large Language Models

Matthew Spotnitz

¹Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York, United States

,

Betina Idnay

¹Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York, United States

,

Emily R. Gordon

¹Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York, United States

²Department of Dermatology, Vagelos College of Physicians and Surgeons, Columbia University Irving Medical Center, New York, New York, United States

,

Rebecca Shyu

¹Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York, United States

,

Gongbo Zhang

¹Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York, United States

,

Cong Liu

¹Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York, United States

,

James J. Cimino

¹Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York, United States

³Department of Biomedical Informatics and Data Science, Informatics Institute, Heersink School of Medicine, University of Alabama at Birmingham, Birmingham, Alabama, United States

,

Chunhua Weng

¹Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, New York, United States

› Author Affiliations

Abstract

Full Text

PDF Download

Keywords

artificial intelligence - clinical decision support - clinical informatics - clinical information systems

Background and Significance

Large language models (LLMs), which are a type of artificial intelligence (AI), are designed to process and understand human language. They are usually trained on massive amounts of text. For example, ChatGPT is a very efficient LLM that has garnered a great deal of public attention for its capabilities since its recent introduction in late 2022.[1] [2] [3] The possible health care applications of LLMs are numerous. Representative examples include generating clinical documentation, personalized educational materials, and original scientific manuscripts.[4] [5]

One well-known limitation of ChatGPT is its tendency for “hallucination,” the generation of text that is perceived as convincing but is not accurate.[6] A second limitation is that ChatGPT can propagate bias that is intrinsic in the training data. These issues have raised concerns about the safety of LLM use in health care. Specifically, some researchers envision scenarios where ChatGPT could provide clinical care advice that is outdated, inaccurate, or incomplete.[7] [8] [9] [10]

Determining the best uses of LLMs in health care has been the focus of recent studies. In a prior publication, clinicians with informatics expertise evaluated LLMs for clinical decision support and concluded that they may provide valuable assistance (Liu et al, 2023).[11] However, that study did not address the experience of novice LLM users. Furthermore, few studies have investigated health care provider comfort with LLMs or used both quantitative and qualitative methods. Those studies either asked general questions about the suitability of LLMs in different health care domains[12] [13] or compared it to human performance on one health care delivery service.[14]

Presently, applications of LLMs are being developed at a rapid pace and could have widespread adoption within health care by novice and expert users alike. The most ethical and effective implementation of the technology must consider user requirements and concerns from representative stakeholders of this technology in the clinical setting. In this study, we surveyed diverse practicing clinicians about using LLMs for tasks in clinical practice, research, and education and summarized their perceptions of the potential and limitations of LLMs, to inform the development of clinically meaningful evaluation standards for LLMs to ensure their appropriate and ethical implementation in clinical settings.

Methods

Study Design and Sampling

The survey instrument, which is shown in [Supplementary Appendix 1] (available in the online version only), was developed by two authors with both clinical and informatics experience (M.S., B.I.) and refined based on feedback from a third author (E.R.G.). It was implemented through Qualtrics (Qualtrics, Provo, UT), took approximately 15 minutes to complete, and gauged clinicians' perceptions on the appropriateness of using LLMs in clinical practice, research, and education.

The opening questions quantified participants' experience in clinical medicine and informatics with multiple choice answers. Next, there were questions that asked if the amount of LLM experience in health care within the past year had exceeded 50 hours. Then, there were questions that prompted rating the appropriateness of LLM use for 23 different tasks in clinical practice, research, and education on a 5-point Likert scale (i.e., Highly Appropriate to Highly Inappropriate). Those tasks represented a sample of proposed LLM uses that were synthesized from the literature and included, but were not limited to, optimizing alerts for clinical decision support, providing a differential diagnosis, writing a discharge summary, recommending treatment options, translating radiology reports into layperson language, writing scientific manuscripts, and generating personalized study plans for students or trainees among others.[2] [6] [7] [8] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25]

We distributed the proposed clinical practice tasks over two questions, and had one question for research tasks and one question for education tasks. The purpose of this section was to measure the appropriateness of these tasks by category and to determine if any individual tasks were negative or positive outliers. We hypothesized that perceptions about the strengths, limitations, and ethical concerns about LLMs could contribute to the ratings. Therefore, we had open-ended questions about each of those as well as an open-ended question about other possible uses of LLMs.

Data Collection

We recruited participants with an email invitation that was sent to a listserv of clinicians at Columbia University Irving Medical Center and by word of mouth. To be eligible, participants needed to be practicing clinicians affiliated with Columbia University within the past 12 months and were able to comprehend and communicate fluently in English. Respondents were compensated with a $20 Amazon Gift Card for completing the survey.

Data Analysis

We calculated descriptive statistics on the participants and tabulated their ratings for each question. Two independent reviewers performed an inductive thematic analysis with the narrative comments. Both of them performed independent coding of free text using NVivo (Version 14) with generation of themes. They met regularly for a total of three times and developed themes iteratively. Once a consensus was reached, the reviewers determined a final list of themes and applied them to the narrative comments. A third reviewer was available to resolve any discrepancies.

Results

We recruited practicing clinicians from internal medicine, otolaryngology, ophthalmology, pediatrics, urology, anesthesiology, neurosurgery, and general surgery. We distributed a prescreening survey to 350 clinicians, among whom 108 responded, and 30 were eligible and enrolled. All completed the survey. Their demographics are shown in [Table 1].

Table 1
Participant information
Survey characteristics	N (%)
Clinical Training
1–2 years	12 (40)
3–5 years	3 (10)
> 5 years	15 (50)
Informatics Training
None	26 (86.7)
1–2 years	3 (10)
3+ years	1 (3.3)
LLM use within the past 12 months
< 50 hours	28 (93.3)
50+ hours	2 (6.7)

Abbreviations: LLM, large language model; N, number; %, percent.

Survey Ratings

Heat maps of the ratings for clinical, research, and education tasks are shown in [Tables 2] [3] [4]. Of the 23 tasks, 16 (69.6%) had positive ratings by at least half of the participants. The highest rated tasks were “assist with vaccine development by predicting the antigenicity of different proteins from genomic data” (25 positive ratings from 30 participants), “model the spread and transmission of an infectious disease” (25 positive ratings), “generate case studies for training purposes” (24 positive ratings), “monitor data for an emerging disease cluster” (24 positive ratings), and “generate alerts to improve compliance with clinical guidelines” (24 positive ratings). In contrast, 7 out of 23 tasks had positive ratings by fewer than half of the participants. Two of the tasks with the lowest number of positive ratings also had the highest number of negative ratings, which were “Respond to patient questions about a radiology report” (7 positive ratings, 16 negative ratings), and “Write an original scientific manuscript” (5 positive ratings, 20 negative ratings).

Table 2
Heatmap of ratings for large language model uses in clinical practice tasks (orange = lowest; yellow = highest)

Table 3
Heatmap of ratings for large language model uses in research tasks (orange = lowest; yellow = highest)

Table 4
Heatmap of ratings for large language model uses in education tasks (orange = lowest; yellow = highest)

Thematic Analysis

We received 20 open-ended responses about LLM strengths, limitations, and ethical concerns, respectively. The responses about the limitations and ethical concerns of LLMs were very similar; therefore, we combined them for a total of 40 responses. There were 19 responses about additional uses of LLMs. The themes and corresponding examples are shown in [Table 5]. Some respondent answers addressed multiple themes and were mapped to each of them. The full responses to our open-ended questions are shown in [Supplementary Appendix 2] (available in the online version only).

Table 5
Summary of narrative comments about perceived advantages, ethical concerns, and clinical applications of large language models with representative examples
Advantages (n = 20)	Ethical concerns (n = 40)	Recommended clinical applications for using LLMs (n = 19)
Aptitude for specific tasks (n = 10) Ability to generate first drafts with low effort It can also help students and providers come up with a differential diagnosis Theoretically could reduce paperwork/administrative work Ability to write code for novice programmers	False information (n = 15) Hallucination, fabrication, reinforcement of assumptions and biases With the confabulation/hallucination issue, does not allow for the uncertainty that is almost always present in medicine Its propensity to make up information	Drafting documentation (n = 8) Note templates/drafts, especially for routine and predictable things like procedures
Synthesis ability (n = 9) Synthesize large amounts of data quickly Good at synthesizing information in a clear concise fashion	Worsens patient care (n = 14) This technology if unchecked at a patient care level may have serious implications of harm to patients Major concern about inappropriate use by lay public to self-diagnose Also worry about who gets care from health care workers versus from direct-to-patient LLM which could be less personalized, and initially less validated and trustworthy	Decision support (n = 5) Anything providing recommendations to patients or providers Flagging concerning trends (VS, laboratory values) earlier; providing guidance in managing chronic conditions
Efficiency (n = 8) Saves time and improves efficiency	Data bias (n = 12) Results are only as good as the datasets that are fed into the LLM Given the fast pace of evidence in health care, can be trained on old evidence Poor data quality leads to poor answers Replicates existing biases	Patient communication (n = 4) Drafting replies to patient messages in the outpatient inbox that are modeled off of the provider's communication style
Accuracy (n = 5) Fairly accurate and provide higher quality, more personalized information than most patient-facing information available on the internet	Human oversight critical (n = 7) They should not replace informed decision-making for patients or clinical decision-making for doctors completely	–
Accessibility (n = 4) Translating medical documents into plain English	Impersonal (n = 6) Worry about who gets care from health care workers versus from direct-to-patient LLM which could be less personalized	–
–	Legal concerns (n = 5) Gray area of ethical/legal limitations	–
–	Privacy (n = 3) Worries of patient confidentiality	–
–	Worsens clinicians (n = 2) If we become reliant on LLMs, we may lose opportunities to practice interpreting/synthesizing data ourselves	–

Abbreviation: LLM, large language model; VS, vital signs.

Discussion

LLMs promise to transform health care. A human-centered approach is critical to ensure ethical and effective implementation of this powerful technology in clinical settings. This was the first study of clinical practitioners, who were mostly novice LLM users and from diverse health care domains, to rate tasks that may be improved by LLMs.

The fundamental theorem of biomedical informatics is user augmentation so that “a person working in partnership with an information resource is better than that same person unassisted.”[26] Similarly, the clinicians who we studied were encouraging of having LLMs as their assistants. The tasks that leveraged LLMs for supportive roles were rated the highest. In the qualitative analysis, emerging themes were that LLMs were highly skilled at different tasks; however, there were ethical concerns about using the technology. Supportive LLM roles may have been more popular, because in those scenarios clinicians could correct for false information that the algorithms might generate.

Therefore, we expect that clinicians would prefer to have LLMs function more like trainees or physician extenders than attending physicians. LLMs could assist clinicians by drafting notes and reports, making suggestions for patient triage, extracting patient information from charts, and identifying discrepancies from standard patient care. Since LLMs are very skilled at processing large amount of data, they could help monitor patients in critical care and perioperative settings. Also, they could help translate medical information between languages, or from technical jargon into layperson language. The contributions of LLMs to these tasks could be reviewed by a clinician.

However, the notion of allowing LLMs to function without supervision in clinical practice raises ethical concerns. They have a propensity to produce false information and propagate data bias, which could lead to incorrect medical decisions. Furthermore, LLMs lack human empathy, which could be a source of mistrust with patients. Instead, patients are more likely to trust medical advice from a clinician because of the human connection. Overall, we believe that clinicians would prefer to have LLMs assist them instead of replace their practice.

Our study participants were encouraging of LLM assistance in the research and education domains as well. In research, the processing power of LLMs would allow them to help with a range of statistical analyses. Also, their linguistic capabilities could translate ideas across human and programming languages. Those skills could be especially useful in large research networks, which consist of individuals from different countries and who have different programming skills. However, having LLMs author an original manuscript instead of a researcher would raise similar ethical concerns to allowing LLMs to function as an autonomous clinician. The education tasks raised the fewest ethical concerns, perhaps because students have regular supervision and a smaller role in direct patient care than clinicians or researchers.

Our sampling method followed a defined set of recruitment criteria and enrolled a total of 30 practicing clinicians who completed the survey for this study. While a larger number of respondents would have been desirable, ours covered a variety of clinical domains and provided valuable, original insights regarding the ethical and reliable uses of LLMs in clinical settings. Given the unusually rapid evolution of LLM technology, this early study is timely and makes meaningful contributions by including the voices of key stakeholders of implementing LLMs for clinical tasks.

A limitation of our study, and a potential source of sampling bias, is that only a relatively small number of participants from a single medical center were recruited by convenience sampling. Also, we used self-reported data as key elements of our analysis. These data may have introduced biases due to varying accuracy in self-reports and varying awareness of the problems by reporting individuals. Despite these limitations, we have developed an instrument that is capable of discerning different opinions about LLM use. We hope our findings can be taken into consideration by developers as the field continues its rapid evolution. As further progress is made, and clinicians have more significant experience with this technology, subsequent studies can build on our methods and experience to study larger sample sizes at multiple institutions to gain additional insights for future directions.

Future studies with a larger and more diverse sample will be warranted to ensure the generalizability of the results and allow for stratification by variables that could affect perceptions of LLM use, such as age, duration of clinical training, provider specialty, and experience with the technology. Those perceptions could be tracked longitudinally to gauge how they change over time. A more robust study about participants' general knowledge of LLMs and AI would strengthen future studies. Specifically, gauging to what extent participants understand how an AI algorithm is able to work, predict, learn, and generate responses, would be a valuable part of an analysis. Furthermore, comparing the perceptions regarding different LLMs, and how LLM-generated errors compare with human errors, may provide a more balanced view of the technology.

Our study found that health care providers would prefer to have LLMs assist than replace them. That finding has implications for future development and implementation of LLMs in clinical practice, research, and education. Studying active clinicians with novice LLM experience helped identify that preference. Therefore, for optimal development and implementation of LLMs in health care, continued human centered development is critical.

Conclusion

Clinicians are generally supportive of the use of LLMs for many tasks in clinical practice, research, and education, especially where LLMs play a supportive role to humans. Continued human centered development of the technology is critical.

Clinical Relevance Statement

We studied health care providers about the best uses of LLMs in health care. The clinicians who we studied were encouraging of having LLMs assist them for a range of tasks. The results of our work have implications for implementation of LLMs in health care.

Multiple Choice Questions

Which of the following are ethical concerns about LLM use?
- Efficiency
- Confabulation or hallucination
- Ability to synthesize information
- Capacity to make technical language accessible
Answer: b. Confabulation or hallucination can cause the LLMs to generate false information, which can lead to incorrect medical decisions. The other answer choices are advantages of the technology.
What is the fundamental theorem of biomedical informatics?
- An information resource is better without assistance from a person.
- An information resource working in partnership with a person is better than an information resource unassisted.
- A person working in partnership with an information resource is better than that same person unassisted.
- A person is better without an information resource.
Answer: c. The fundamental theorem of biomedical informatics states that people are more effective when partnered with an information resource. The alternatives, which are to have no cooperation with information resources and people, or to have people assist information resources, are less effective.

References

References
1 Sezgin E, Sirrianni J, Linwood SL. Operationalizing and implementing pretrained, large artificial intelligence linguistic models in the US health care system: outlook of generative pretrained transformer 3 (GPT-3) as a service model. JMIR Med Inform 2022; 10 (02) e32875
2 Elkassem AA, Smith AD. Potential use cases for ChatGPT in radiology reporting. AJR Am J Roentgenol 2023; 221 (03) 373-376
3 Rao A, Pang M, Kim J. et al. Assessing the utility of ChatGPT throughout the entire clinical workflow: development and usability study. J Med Internet Res 2023; 25: e48659
4 Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J Med Syst 2023; 47 (01) 33
5 Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthc Pap 2023; 11 (06) 867
6 Athaluri SA, Manthena SV, Kesapragada VSRKM, Yarlagadda V, Dave T, Duddumpudi RTS. Exploring the boundaries of reality: investigating the phenomenon of artificial intelligence hallucination in scientific writing through ChatGPT references. Cureus 2023; 15 (04) e37432
7 Peng Y, Rousseau JF, Shortliffe EH, Weng C. AI-generated text may have a role in evidence-based medicine. Nat Med 2023; 29 (07) 1593-1594
8 Tang L, Sun Z, Idnay B. et al. Evaluating large language models on medical evidence summarization. NPJ Digit Med 2023; 6 (01) 158
9 Deik A. Potential benefits and perils of incorporating ChatGPT to the movement disorders clinic. J Mov Disord 2023; 16 (02) 158-162
10 Shashavar Y, Choudhury A. User intentions to use ChatGPT for self-diagnosis and health-related purposes: cross-sectional survey study. JMIR Hum Factors 2023; 10: e47564
11 Liu S, Wright AP, Patterson BL. et al. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J Am Med Inform Assoc 2023; 30 (07) 1237-1245
12 Hosseini M, Gao CA, Leibovitz DM. et al. An exploratory survey about using ChatGPT in education, healthcare, and research. medRxiv 2023.03.31.23287979
13 Choudhury A, Shamszare H. Investigating the impact of user trust on the adoption and use of ChatGPT: survey analysis. J Med Internet Res 2023; 25: e47184
14 Dash D, Rahul T, Banda JM. et al. Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery. 2023 . arXiv:2304.13714
15 Hirosawa T, Harada Y, Yokose M, Sakamoto T, Kawamura R, Shimizu T. Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 chatbot for clinical vignettes with common chief complaints: a pilot study. Int J Environ Res Public Health 2023; 20 (04) 3378
16 Cheng K, Li Z, He Y. et al. Potential use of artificial intelligence in infectious disease: take ChatGPT as an example. Ann Biomed Eng 2023; 51 (06) 1130-1135
17 Patel SB, Lam K. ChatGPT: the future of discharge summaries?. Lancet Digit Health 2023; 5 (03) e107-e108
18 Galido PV, Butala S, Chakerian M, Agustines D. A case study demonstrating applications of ChatGPT in the clinical management of treatment-resistant schizophrenia. Cureus 2023; 15 (04) e38166
19 Sharma S, Pajai S, Prasad R. et al. A critical review of ChatGPT as a potential substitute for diabetes educators. Cureus 2023; 15 (05) e38380
20 Macdonald C, Adeloye D, Sheikh A, Rudan I. Can ChatGPT draft a research article? An example of population-level vaccine effectiveness analysis. J Glob Health 2023; 13: 01003
21 Májovský M, Černý M, Kasal M, Komarc M, Netuka D. Artificial intelligence can generate fraudulent but authentic-looking scientific medical articles: Pandora's box has been opened. J Med Internet Res 2023; 25: e46924
22 Abd-Alrazaq A, AlSaad R, Alhuwail D. et al. Large language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ 2023; 9: e48291
23 Eysenbach G. The role of ChatGPT, generative language models, and artificial intelligence in medical education: a conversation with ChatGPT and a call for papers. JMIR Med Educ 2023; 9: e46885
24 Karabacak M, Ozkara BB, Margetis K, Wintermark M, Bisdas S. The advent of generative language models in medical education. JMIR Med Educ 2023; 9: e48163
25 Kanjee Z, Crowe B, Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA 2023; 330 (01) 78-80
26 Friedman CPA. A “fundamental theorem” of biomedical informatics. J Am Med Inform Assoc 2009; 16 (02) 169-170

Figures

Supplementary Material

Supplementary Material