Keywords
artificial intelligence - clinical decision support - clinical informatics - clinical information systems
Background and Significance
Background and Significance
Large language models (LLMs), which are a type of artificial intelligence (AI), are designed to process and understand human language. They are usually trained on massive amounts of text. For example, ChatGPT is a very efficient LLM that has garnered a great deal of public attention for its capabilities since its recent introduction in late 2022.[1]
[2]
[3] The possible health care applications of LLMs are numerous. Representative examples include generating clinical documentation, personalized educational materials, and original scientific manuscripts.[4]
[5]
One well-known limitation of ChatGPT is its tendency for “hallucination,” the generation of text that is perceived as convincing but is not accurate.[6] A second limitation is that ChatGPT can propagate bias that is intrinsic in the training data. These issues have raised concerns about the safety of LLM use in health care. Specifically, some researchers envision scenarios where ChatGPT could provide clinical care advice that is outdated, inaccurate, or incomplete.[7]
[8]
[9]
[10]
Determining the best uses of LLMs in health care has been the focus of recent studies. In a prior publication, clinicians with informatics expertise evaluated LLMs for clinical decision support and concluded that they may provide valuable assistance (Liu et al, 2023).[11] However, that study did not address the experience of novice LLM users. Furthermore, few studies have investigated health care provider comfort with LLMs or used both quantitative and qualitative methods. Those studies either asked general questions about the suitability of LLMs in different health care domains[12]
[13] or compared it to human performance on one health care delivery service.[14]
Presently, applications of LLMs are being developed at a rapid pace and could have widespread adoption within health care by novice and expert users alike. The most ethical and effective implementation of the technology must consider user requirements and concerns from representative stakeholders of this technology in the clinical setting. In this study, we surveyed diverse practicing clinicians about using LLMs for tasks in clinical practice, research, and education and summarized their perceptions of the potential and limitations of LLMs, to inform the development of clinically meaningful evaluation standards for LLMs to ensure their appropriate and ethical implementation in clinical settings.
Methods
Study Design and Sampling
The survey instrument, which is shown in [Supplementary Appendix 1] (available in the online version only), was developed by two authors with both clinical and informatics experience (M.S., B.I.) and refined based on feedback from a third author (E.R.G.). It was implemented through Qualtrics (Qualtrics, Provo, UT), took approximately 15 minutes to complete, and gauged clinicians' perceptions on the appropriateness of using LLMs in clinical practice, research, and education.
The opening questions quantified participants' experience in clinical medicine and informatics with multiple choice answers. Next, there were questions that asked if the amount of LLM experience in health care within the past year had exceeded 50 hours. Then, there were questions that prompted rating the appropriateness of LLM use for 23 different tasks in clinical practice, research, and education on a 5-point Likert scale (i.e., Highly Appropriate to Highly Inappropriate). Those tasks represented a sample of proposed LLM uses that were synthesized from the literature and included, but were not limited to, optimizing alerts for clinical decision support, providing a differential diagnosis, writing a discharge summary, recommending treatment options, translating radiology reports into layperson language, writing scientific manuscripts, and generating personalized study plans for students or trainees among others.[2]
[6]
[7]
[8]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
We distributed the proposed clinical practice tasks over two questions, and had one question for research tasks and one question for education tasks. The purpose of this section was to measure the appropriateness of these tasks by category and to determine if any individual tasks were negative or positive outliers. We hypothesized that perceptions about the strengths, limitations, and ethical concerns about LLMs could contribute to the ratings. Therefore, we had open-ended questions about each of those as well as an open-ended question about other possible uses of LLMs.
Data Collection
We recruited participants with an email invitation that was sent to a listserv of clinicians at Columbia University Irving Medical Center and by word of mouth. To be eligible, participants needed to be practicing clinicians affiliated with Columbia University within the past 12 months and were able to comprehend and communicate fluently in English. Respondents were compensated with a $20 Amazon Gift Card for completing the survey.
Data Analysis
We calculated descriptive statistics on the participants and tabulated their ratings for each question. Two independent reviewers performed an inductive thematic analysis with the narrative comments. Both of them performed independent coding of free text using NVivo (Version 14) with generation of themes. They met regularly for a total of three times and developed themes iteratively. Once a consensus was reached, the reviewers determined a final list of themes and applied them to the narrative comments. A third reviewer was available to resolve any discrepancies.
Results
We recruited practicing clinicians from internal medicine, otolaryngology, ophthalmology, pediatrics, urology, anesthesiology, neurosurgery, and general surgery. We distributed a prescreening survey to 350 clinicians, among whom 108 responded, and 30 were eligible and enrolled. All completed the survey. Their demographics are shown in [Table 1].
Table 1
Participant information
Survey characteristics
|
N (%)
|
Clinical Training
|
1–2 years
|
12 (40)
|
3–5 years
|
3 (10)
|
> 5 years
|
15 (50)
|
Informatics Training
|
None
|
26 (86.7)
|
1–2 years
|
3 (10)
|
3+ years
|
1 (3.3)
|
LLM use within the past 12 months
|
< 50 hours
|
28 (93.3)
|
50+ hours
|
2 (6.7)
|
Abbreviations: LLM, large language model; N, number; %, percent.
Survey Ratings
Heat maps of the ratings for clinical, research, and education tasks are shown in [Tables 2]
[3]
[4]. Of the 23 tasks, 16 (69.6%) had positive ratings by at least half of the participants. The highest rated tasks were “assist with vaccine development by predicting the antigenicity of different proteins from genomic data” (25 positive ratings from 30 participants), “model the spread and transmission of an infectious disease” (25 positive ratings), “generate case studies for training purposes” (24 positive ratings), “monitor data for an emerging disease cluster” (24 positive ratings), and “generate alerts to improve compliance with clinical guidelines” (24 positive ratings). In contrast, 7 out of 23 tasks had positive ratings by fewer than half of the participants. Two of the tasks with the lowest number of positive ratings also had the highest number of negative ratings, which were “Respond to patient questions about a radiology report” (7 positive ratings, 16 negative ratings), and “Write an original scientific manuscript” (5 positive ratings, 20 negative ratings).
Table 2
Heatmap of ratings for large language model uses in clinical practice tasks (orange = lowest; yellow = highest)
|
Table 3
Heatmap of ratings for large language model uses in research tasks (orange = lowest; yellow = highest)
|
Table 4
Heatmap of ratings for large language model uses in education tasks (orange = lowest; yellow = highest)
|
Thematic Analysis
We received 20 open-ended responses about LLM strengths, limitations, and ethical concerns, respectively. The responses about the limitations and ethical concerns of LLMs were very similar; therefore, we combined them for a total of 40 responses. There were 19 responses about additional uses of LLMs. The themes and corresponding examples are shown in [Table 5]. Some respondent answers addressed multiple themes and were mapped to each of them. The full responses to our open-ended questions are shown in [Supplementary Appendix 2] (available in the online version only).
Table 5
Summary of narrative comments about perceived advantages, ethical concerns, and clinical applications of large language models with representative examples
Advantages (n = 20)
|
Ethical concerns (n = 40)
|
Recommended clinical applications for using LLMs (n = 19)
|
Aptitude for specific tasks (n = 10)
Ability to generate first drafts with low effort
It can also help students and providers come up with a differential diagnosis
Theoretically could reduce paperwork/administrative work
Ability to write code for novice programmers
|
False information (n = 15)
Hallucination, fabrication, reinforcement of assumptions and biases
With the confabulation/hallucination issue, does not allow for the uncertainty that is almost always present in medicine
Its propensity to make up information
|
Drafting documentation (n = 8)
Note templates/drafts, especially for routine and predictable things like procedures
|
Synthesis ability (n = 9)
Synthesize large amounts of data quickly
Good at synthesizing information in a clear concise fashion
|
Worsens patient care (n = 14)
This technology if unchecked at a patient care level may have serious implications of harm to patients
Major concern about inappropriate use by lay public to self-diagnose
Also worry about who gets care from health care workers versus from direct-to-patient LLM which could be less personalized, and initially less validated and trustworthy
|
Decision support (n = 5)
Anything providing recommendations to patients or providers
Flagging concerning trends (VS, laboratory values) earlier; providing guidance in managing chronic conditions
|
Efficiency (n = 8)
Saves time and improves efficiency
|
Data bias (n = 12)
Results are only as good as the datasets that are fed into the LLM
Given the fast pace of evidence in health care, can be trained on old evidence
Poor data quality leads to poor answers
Replicates existing biases
|
Patient communication (n = 4)
Drafting replies to patient messages in the outpatient inbox that are modeled off of the provider's communication style
|
Accuracy (n = 5)
Fairly accurate and provide higher quality, more personalized information than most patient-facing information available on the internet
|
Human oversight critical (n = 7)
They should not replace informed decision-making for patients or clinical decision-making for doctors completely
|
–
|
Accessibility (n = 4)
Translating medical documents into plain English
|
Impersonal (n = 6)
Worry about who gets care from health care workers versus from direct-to-patient LLM which could be less personalized
|
–
|
–
|
Legal concerns (n = 5)
Gray area of ethical/legal limitations
|
–
|
–
|
Privacy (n = 3)
Worries of patient confidentiality
|
–
|
–
|
Worsens clinicians (n = 2)
If we become reliant on LLMs, we may lose opportunities to practice interpreting/synthesizing data ourselves
|
–
|
Abbreviation: LLM, large language model; VS, vital signs.
Discussion
LLMs promise to transform health care. A human-centered approach is critical to ensure ethical and effective implementation of this powerful technology in clinical settings. This was the first study of clinical practitioners, who were mostly novice LLM users and from diverse health care domains, to rate tasks that may be improved by LLMs.
The fundamental theorem of biomedical informatics is user augmentation so that “a person working in partnership with an information resource is better than that same person unassisted.”[26] Similarly, the clinicians who we studied were encouraging of having LLMs as their assistants. The tasks that leveraged LLMs for supportive roles were rated the highest. In the qualitative analysis, emerging themes were that LLMs were highly skilled at different tasks; however, there were ethical concerns about using the technology. Supportive LLM roles may have been more popular, because in those scenarios clinicians could correct for false information that the algorithms might generate.
Therefore, we expect that clinicians would prefer to have LLMs function more like trainees or physician extenders than attending physicians. LLMs could assist clinicians by drafting notes and reports, making suggestions for patient triage, extracting patient information from charts, and identifying discrepancies from standard patient care. Since LLMs are very skilled at processing large amount of data, they could help monitor patients in critical care and perioperative settings. Also, they could help translate medical information between languages, or from technical jargon into layperson language. The contributions of LLMs to these tasks could be reviewed by a clinician.
However, the notion of allowing LLMs to function without supervision in clinical practice raises ethical concerns. They have a propensity to produce false information and propagate data bias, which could lead to incorrect medical decisions. Furthermore, LLMs lack human empathy, which could be a source of mistrust with patients. Instead, patients are more likely to trust medical advice from a clinician because of the human connection. Overall, we believe that clinicians would prefer to have LLMs assist them instead of replace their practice.
Our study participants were encouraging of LLM assistance in the research and education domains as well. In research, the processing power of LLMs would allow them to help with a range of statistical analyses. Also, their linguistic capabilities could translate ideas across human and programming languages. Those skills could be especially useful in large research networks, which consist of individuals from different countries and who have different programming skills. However, having LLMs author an original manuscript instead of a researcher would raise similar ethical concerns to allowing LLMs to function as an autonomous clinician. The education tasks raised the fewest ethical concerns, perhaps because students have regular supervision and a smaller role in direct patient care than clinicians or researchers.
Our sampling method followed a defined set of recruitment criteria and enrolled a total of 30 practicing clinicians who completed the survey for this study. While a larger number of respondents would have been desirable, ours covered a variety of clinical domains and provided valuable, original insights regarding the ethical and reliable uses of LLMs in clinical settings. Given the unusually rapid evolution of LLM technology, this early study is timely and makes meaningful contributions by including the voices of key stakeholders of implementing LLMs for clinical tasks.
A limitation of our study, and a potential source of sampling bias, is that only a relatively small number of participants from a single medical center were recruited by convenience sampling. Also, we used self-reported data as key elements of our analysis. These data may have introduced biases due to varying accuracy in self-reports and varying awareness of the problems by reporting individuals. Despite these limitations, we have developed an instrument that is capable of discerning different opinions about LLM use. We hope our findings can be taken into consideration by developers as the field continues its rapid evolution. As further progress is made, and clinicians have more significant experience with this technology, subsequent studies can build on our methods and experience to study larger sample sizes at multiple institutions to gain additional insights for future directions.
Future studies with a larger and more diverse sample will be warranted to ensure the generalizability of the results and allow for stratification by variables that could affect perceptions of LLM use, such as age, duration of clinical training, provider specialty, and experience with the technology. Those perceptions could be tracked longitudinally to gauge how they change over time. A more robust study about participants' general knowledge of LLMs and AI would strengthen future studies. Specifically, gauging to what extent participants understand how an AI algorithm is able to work, predict, learn, and generate responses, would be a valuable part of an analysis. Furthermore, comparing the perceptions regarding different LLMs, and how LLM-generated errors compare with human errors, may provide a more balanced view of the technology.
Our study found that health care providers would prefer to have LLMs assist than replace them. That finding has implications for future development and implementation of LLMs in clinical practice, research, and education. Studying active clinicians with novice LLM experience helped identify that preference. Therefore, for optimal development and implementation of LLMs in health care, continued human centered development is critical.
Conclusion
Clinicians are generally supportive of the use of LLMs for many tasks in clinical practice, research, and education, especially where LLMs play a supportive role to humans. Continued human centered development of the technology is critical.
Clinical Relevance Statement
Clinical Relevance Statement
We studied health care providers about the best uses of LLMs in health care. The clinicians who we studied were encouraging of having LLMs assist them for a range of tasks. The results of our work have implications for implementation of LLMs in health care.
Multiple Choice Questions
Multiple Choice Questions
-
Which of the following are ethical concerns about LLM use?
-
Efficiency
-
Confabulation or hallucination
-
Ability to synthesize information
-
Capacity to make technical language accessible
Answer: b. Confabulation or hallucination can cause the LLMs to generate false information, which can lead to incorrect medical decisions. The other answer choices are advantages of the technology.
-
What is the fundamental theorem of biomedical informatics?
-
An information resource is better without assistance from a person.
-
An information resource working in partnership with a person is better than an information resource unassisted.
-
A person working in partnership with an information resource is better than that same person unassisted.
-
A person is better without an information resource.
Answer: c. The fundamental theorem of biomedical informatics states that people are more effective when partnered with an information resource. The alternatives, which are to have no cooperation with information resources and people, or to have people assist information resources, are less effective.