Keywords AI - medical - education
Introduction
The field of medical education is continually evolving with advancements in technology reshaping the way medical students are trained and assessed. One such technological innovation that has garnered significant attention in recent years is the integration of large language models (LLMs) [1 ]. A significant advantage of LLMs such as ChatGPT is their ability to provide explanations for solutions, thereby making it easier for students to understand exam architecture. Learning content can be tailored based on the userʼs knowledge level, and the chat function allows interactive learning.
LLMs, such as ChatGPT, are supported by deep neural networks and have been trained on vast datasets. These models have profound text analysis and generation capabilities, making them exceptionally promising tools for both medical practice and education [2 ]
[3 ].
As an indispensable component of medical practice, radiology necessitates profound comprehension of intricate imaging studies and clinical implications. Medical students, on their journey toward becoming proficient healthcare professionals, undergo rigorous training and examinations to help them gain the requisite skills and knowledge. Although the field of artificial intelligence in diagnostic radiology has primarily centered on image analysis, there has been growing enthusiasm surrounding the potential applications of LLMs, including ChatGPT, within radiology [4 ]
[5 ]
[6 ]. These applications encompass a wide spectrum, including radiology education, assistance in differential diagnoses, computer-aided diagnosis, and disease classification [5 ]
[6 ]
[7 ]. If these LLMs can demonstrate accuracy and reliability, they have the potential to serve as invaluable resources for learners, enabling rapid responses to inquiries and simplification of intricate concepts. ChatGPT has already undergone investigation regarding its potential with respect to streamlining radiology reports and facilitating clinical decision-making [8 ]
[9 ]. Furthermore, LLMs have already performed commendably in a diverse array of professional examinations, even without specialized domain pretraining [10 ]. In the realm of medicine, they showed convincing results with respect to medical examinations [11 ]
[12 ]
[13 ].
The aim of this study was to explore and evaluate the performance of LLMs in radiology examinations for medical students in order to provide insight into the present capabilities and implications of LLMs.
Methods
This exploratory prospective study was carried out from August to October 2023. We obtained informed consent from the head of the institute to utilize the instituteʼs own radiology examination questions for medical students.
Multiple-Choice Question Selection and Classification
200 multiple-choice questions, each featuring four incorrect answers and one correct answer, were identified using the database of our radiology institute. These questions were originally designed for use in the radiology examination for medical students at our hospital. The exclusion criteria comprised questions containing images (n =40) and questions with multiple correct answers (n = 9). After this selection process, 151 questions remained. The questions were then either prompted through OpenAIʼs API for GPT-3.5 and GPT-4 or manually pasted into the user interface (UI) of Perplexity AI (GPT 3.5+Bing). To avoid the influence of previous responses on the modelʼs output, a new ChatGPT session was initiated for each query. All questions were asked in three separate ChatGPT sessions, and the average performance was calculated.
A simple prompt for the question was used in the following form:
Question: {question text} A: {answer A} B: {answer B} C: {answer C} D: {answer D} E: {answer E}
For the initial prompt we used:
You are an expert radiologist. Answer the following multiple-choice question in the form: <Single letter (answer)> <Text explaining the reason>
The outputs were restructured and combined for statistical analysis. A passing score was considered to be 60% or above. Additionally, the questions were categorized based on their type as either lower- or higher- order thinking questions, along with their subject matter, as detailed in [Table 1 ]. Lower-order thinking encompasses tasks related to remembering and basic understanding, while higher-order thinking involves the application, analysis, and evaluation of concepts. The higher-order thinking as well as the lower-order thinking questions were further subclassified by type (description of imaging findings, clinical management, comprehension, knowledge). Each question underwent independent classification by two radiologists. A flowchart of the study design is displayed in [Fig. 1 ]
[14 ].
Table 1 Performance of LLMs and medical students stratified by question type and topic.
Total
Students
GPT-3.5
GPT-3.5 + Bing
GPT-4
N
%
N
%
N
%
N
%
All questions
151
115.3
76.3
103
68.2
108
71.5
134
88.7
Bone
26
19.4
74.4
17
65.4
18
69.2
23
88.5
Breast
5
3.7
73.2
2
40
2
40
4
80
Cardiovascular
13
9.9
76.9
11
84.6
10
76.9
12
92.3
Chest
19
13.6
71.7
14
73.7
14
73.7
17
89.5
Gastrointestinal
16
12.8
80
11
68.8
12
75
15
93.8
Genitourinary
6
4.2
70
4
66.7
5
83.3
5
83.3
Head and neck
39
31.1
79.8
28
71.8
29
74.4
34
87.2
Physics
11
8.5
77.4
7
63.6
7
63.6
10
90.9
Systemic
16
12
75.1
9
56.3
11
68.8
14
87.5
Clinical management
37
29.1
78.7
26
70.3
27
72.9
33
89.2
Description of imaging findings
27
20.2
74.8
16
59.3
21
77.8
23
85.2
Diagnosis
23
17.2
74.7
16
69.6
14
60.8
22
95.7
Comprehension
28
21.3
76.3
21
75
23
82.1
25
89.3
Knowledge
36
27.5
76.3
24
66.7
23
63.9
31
86.1
Higher-order
87
66.5
76.4
58
66.7
62
71.3
78
89.7
Lower-order
64
48.8
76.3
45
70.3
46
71.9
56
87.5
Fig. 1 Flowchart of the study design. From our initial 200 exam questions, 151 remained after excluding questions with images and questions with more than one correct answer. The questions were then prompted either by OpenAIs API for GPT-3.5 and GPT-4 or manually pasted into the UI of Perplexity AI (GPT 3.5+Bing). The outputs were restructured and combined for statistical analysis. Abbreviations: MC: multiple choice; API: application programming interface; UI: user interface.
Large language models (LLMs)
ChatGPT (ChatGPT August 3, 2023 version, OpenAI) and Perplexity AI were used in this study. There are two versions of ChatGPT: ChatGPT, which is based on GPT-3.5, and ChatGPT Plus, which utilizes the more advanced GPT-4. In this study we used the two underlying LLMs directly via the OpenAI API. No specialized radiology-specific pretraining was conducted for either of these models. It is important to highlight that GPT-3.5 and GPT-4, being server-contained LLMs, lack the capability to access the internet or external databases for information retrieval. In contrast, Perplexity AI (ChatGPT 3.5 +Bing) has the capacity to search the internet.
Medical students
The study included a cohort of 621 medical students who were in their first clinical semester, typically corresponding to their third year in medical school.
Prior to entering the clinical phase, the students completed two years of preclinical education, which included foundational courses in anatomy, physiology, biochemistry, pathology, and basic medical sciences. At the time of the study, the students had completed an introductory course in radiology. However, their exposure to advanced radiological topics was limited compared to more senior students and residents.
Statistical analysis
Statistical analysis was performed using Python (version 3.11). The McNemar test was used to determine the statistical significance of difference regarding the performance of the LLMs. This was also done for subgroups by question type and topic. For overall model performance, we utilized the widely used accuracy score.
To quantify the comparative performance of the LLMs and the medical students, we performed an odds ratio analysis. For each comparison, we set up 2×2 contingency tables that summarize the number of correct and incorrect answers for the two groups being compared. Thereafter, we calculated p-values using Fisherʼs Exact Test. A P-value of less than 0.05 was considered statistically significant. No correction-for-guessing was performed, since the passing score of our exam already accounts for guessing.
Results
Overall performance
The overall accuracy of GPT-3.5 for all 151 questions was 67.6%. In contrast, GPT-4 achieved significantly higher accuracy compared to GPT-3.5 with an overall accuracy of 88.1% (p< 0.001). No significant differences were observed between GPT-3.5+Bing and GPT-3.5 (p=0.44). In comparison, the overall accuracy of the medical students was 76%. All LLMs would have passed the radiology exam for medical students at our university. [Table 1 ] shows the overall performance of the LLMs as well the performance stratified by question type and topic and [Fig. 2 ] shows a question that was answered correctly by all LLMs.
Fig. 2 GPT-3.5 /4.0 and Perplexity AI response to one of the questions. All picked the correct answer (option B). A : GPT-3.5 B : Perplexity AI; C : GPT-4.
Performance by topic
Among the subgroups, GPT4 exhibited the highest performance in the gastrointestinal category, correctly answering 15 out of 16 questions, thus achieving an accuracy of 93.75%. Compared with GPT3.5 and Perplexity AI, GPT-4 demonstrated significantly superior performance with regard to answering questions related to bone diseases (p=0.03). However, subgroup analysis revealed no noteworthy variations in performance across the remaining subspecialty groups.
Questions answered incorrectly by all models
A total of seven questions were answered incorrectly by all models (Table S1 ). Among these, two questions pertained to the use of contrast agents in patients with renal insufficiency, while another related to MRI angiography in patients with a pacemaker.
The remaining questions that stumped all models demanded a nuanced understanding of specific details or specialized knowledge. For instance, one question pertained to renal scintigraphy, where the correct response hinged on the knowledge that Tc 99m-MAG3 is primarily secreted by proximal renal tubules and, therefore, cannot be used to estimate glomerular filtration rate. [Fig. 3 ] illustrates a question that was answered incorrectly by all LLMs.
Fig. 3 Response to a question answered incorrectly: Please be mindful that large language models (LLMs) frequently use assertive language in their responses, even when those responses are incorrect. Abbreviations: LLM: large language models.
Performance by question type
GPT-4 demonstrated significantly superior performance in both lower-order and higher-order questions when compared to GPT-3.5 and Perplexity AI (p = 0.01 and p < 0.001, respectively).
GPT-4 achieved the best performance across all topics and categories compared to medical students, GPT-3.5, and Perplexity AI ([Fig. 4 ]).
Fig. 4 Performance comparison across medical topics: medical students vs. GPT models.
Within the subgroups, GPT-4 exhibited its highest performance when responding to higher-order questions related to diagnosis. It provided correct answers for 22 out of 23 questions in this category, achieving an accuracy of 95.65%.
In contrast, GPT-3.5 and Perplexity AI exhibited their highest performance with respect to the lower-order subgroup comprehension with accuracies of 75.00% and 82.41% ([Table 1 ]). Perplexity AI demonstrated the weakest performance in the higher-order category diagnosis (60.9%) and in the lower-order category knowledge (63.9%), while GPT-3.5 had the weakest performance in the higher-order description of imaging findings (59.3%) and the lower-order category comprehension (75%). The average medical student achieved a similar performance for lower-order questions (76.27%) compared to higher-order questions (76.39%). The performance of the average student was relatively stable across all subgroups. The average student achieved the highest performance with regard to questions related to clinical management with an accuracy of 78.7% and the lowest performance with regard to diagnosis with an accuracy of 74.7% ([Table 1 ], [Fig. 5 ], [Fig. 6 ]).
Fig. 5 Performance comparison in higher- and lower-order tasks: medical students vs. GPT models.
Fig. 6 Performance heatmap across medical topics and cognitive functions.
Odds ratio analysis
The odds ratio analysis confirmed that the overall performance of GPT-4 was significantly superior to that of GPT-3, Perplexity AI, and the medical students. The improved performance was particularly notable for higher-order questions, where GPT-4 showed the greatest improvement over the other GPT models and the students. For example, GPT-4 is 4.3 times more likely to correctly answer higher-order thinking questions than GPT-3.5 (p<0.001). For lower-order thinking questions, while GPT-4 still performed better, the difference was not statistically significant compared to the medical students ([Table 2 ]).
Table 2 Odds ratio analysis of the performance of LLMs and medical students.
Comparison
Odds ratio
p-value
All questions
GPT-4 vs. GPT-3.5
3.4
0.00002
GPT-4 vs. Medical Students
2.2
0.006
GPT-4 vs. Perplexity AI
2.9
0.0003
Higher-order
GPT-4 vs. GPT-3.5
4.3
0.0004
GPT-4 vs. medical students
2.7
0.03
GPT-4 vs. Perplexity AI
3.5
0.004
Lower-order
GPT-4 vs. GPT-3.5
2.9
0.03
GPT-4 vs. Perplexity AI
2.7
0.047
GPT-4 vs. medical students
2.2
0.11
Discussion
The integration of LLMs into various domains has increased remarkably in recent years, with applications ranging from natural language processing to medical diagnostics. In the field of medical education, LLMs have shown immense potential to assist and enhance the learning experience for students, particularly in radiology – a discipline that demands profound understanding of complex medical concepts and terminology.
The present study provides several important key findings to understand how advancements in LLM technology can impact medical education. First, in this exploratory prospective study, all LLMs would have passed the exam. Second, GPT-4 exhibited significantly better performance than its predecessors GPT-3.5, Perplexity AI, and the medical students with 88% of the questions answered correctly. Third, GPT-4 maintained the best performance across all topics and categories compared to the medical students, GPT-3.5, and Perplexity AI. Fourth, the performance improvement was particularly pronounced for higher-order questions, where GPT-4 demonstrated the most significant improvement over the other GPT models and the students. Fifth, GPT-4 demonstrated the highest performance in the gastrointestinal category with an accuracy of 93.75%. The prevalence of gastrointestinal content in training datasets may have contributed to the modelʼs enhanced performance in this domain.
Despite the ability of Perplexity AI to search the internet, it demonstrated the weakest performance with regard to knowledge. Internet searches can yield information from a wide range of sources, including those that are not peer-reviewed or scientifically accurate. Without a sophisticated mechanism to filter and prioritize high-quality, reliable sources, the model might incorporate inaccurate or outdated information. GPT-4ʼs superior performance may be attributed to the fact that GPT-4 benefits from advanced model enhancements, including a deeper architecture and extensive training.
ChatGPT has demonstrated good performance in a wide range of professional examinations, including those in the medical field, even without the need for specialized domain pretraining [10 ]
[11 ]
[12 ]
[13 ]. For instance, it was applied to the USMLE, where ChatGPT achieved accuracy rates exceeding 50% across all examinations and surpassing 60% in certain analyses [11 ].
Despite the absence of radiology-specific training, ChatGPT performed commendably. When new LLMs with radiology-specific pretraining and the ability to process images become publicly available, it will be interesting to see what results can be achieved.
As LLM technology continues to advance, radiologists will need to gain comprehensive understanding of the performance and reliability of these models and of their evolving role in radiology. The development of applications built on LLMs holds promise for further enhancing radiological practice and education, ultimately benefiting both current and future healthcare professionals. However, ChatGPT is designed to discern patterns and associations among words within its training data. Consequently, we anticipate limitations in cases requiring understanding of the context of specialized technical language or specific details and specialized knowledge, such as radiological terminology used in imaging descriptions, calculations, and classification systems.
Furthermore, ChatGPT consistently employs confident language in its responses, even when those responses are incorrect. This tendency is a well-documented limitation of LLMs [15 ]. Even when the most probable available option may be incorrect, ChatGPT tends to generate responses that sound convincingly human-like. Interestingly, increased human likeness in chatbots is associated with a higher level of trust [16 ]. Consequently, ChatGPTʼs inclination to produce plausible yet erroneous responses presents a significant concern when it serves as the sole source of information [17 ]. This concern is particularly critical with regard to individuals who may lack the expertise to discern inaccuracies in its assertions, notably novices. As a result, this behavior currently restricts the practicality of employing ChatGPT in medical education.
To prevent a future where LLMs influence the outcome of medical and radiological exams, several measures can be taken. These include designing exam questions that necessitate critical thinking and the application of knowledge rather than mere recall, integrating practical components or simulations that cannot be easily answered by LLMs, ensuring robust exam proctoring and monitoring procedures to detect any suspicious behavior, and continually updating exam formats and content to stay ahead of potential cheating methods involving LLMs. Additionally, emphasizing the importance of genuine learning and skill acquisition can help maintain the integrity of medical exams amidst technological advancements.
Furthermore, we identified inconsistencies in ChatGPTʼs responses. In a subsequent evaluation, GPT-3.5 yielded different answers for five questions, while GPT-4 provided six different answers, but there were no significant differences in accuracy between the two models. These inconsistencies can be partially mitigated by adjusting parameters such as temperature, top-k, and top-p settings. Temperature controls the randomness of the modelʼs responses; a lower temperature makes the output more focused and deterministic, while a higher temperature increases variability. Top-k limits the model to considering only the top k most likely next words, thus reducing the chance of less probable words being selected. Top-p adjusts the probability mass, allowing the model to consider the smallest possible set of words whose cumulative probability exceeds a certain threshold p, thereby balancing diversity and coherence.
However, this adjustment cannot be made directly through the web interface but can be done, for instance, in the OpenAI playground. Without a nuanced understanding of the influence of these parameters, thereʼs a risk of overestimating or underestimating LLM capabilities, potentially leading to misleading conclusions about their effectiveness in educational settings. Moreover, the variability introduced by different parameter settings may result in significant fluctuations in LLM performance, thus challenging the generalizability of findings to real-world applications. Future research should prioritize comprehensive analyses of the impact of LLM settings on responses to radiology exam questions to ensure accurate assessments and to optimize LLM configurations for educational use in specialized fields.
Furthermore, it is essential to acknowledge certain limitations. First, we excluded questions containing images, which are typically integral to a radiology examination, due to ChatGPTʼs inability to process visual content at the time of this study. To thoroughly assess the performance of the LLMs presented in a real-world scenario, including all question types, further studies are necessary.
Second the pass/fail threshold we applied is an approximation, as normally a passing score of 60% or above is standard for all written components, including those featuring image-based questions. Furthermore, the relatively small number of questions in each subgroup within this exploratory study has limited the statistical power available for subgroup analyses.
In conclusion, our study underscores the potential of LLMs like ChatGPT as a new and readily accessible knowledge source for medical students. Even without radiology-specific pretraining, ChatGPT demonstrated remarkable performance, achieving a passing grade on a radiology examination for medical students that did not include images. The model excelled with respect to higher-order as well as lower-order thinking questions. It is crucial for radiologists to be aware of ChatGPTʼs limitations, including its tendency to confidently generate inaccurate responses. Presently, it cannot be solely relied upon for clinical practice or educational purposes. However, ChatGPT presents an exciting opportunity as a new and readily accessible knowledge source for medical students, offering them a valuable tool to supplement their learning and understanding of radiology concepts.
Declarations
We disclose that the manuscript was proofread by ChatGPT. All sections proofread by ChatGPT were meticulously reviewed. Additionally, we adhered to data protection regulations, ensuring that only anonymized data was uploaded.
Statistical analysis was performed using Python (version 3.11). ChatGPT was utilized to understand and debug the Python code and adjust the graphics ([Fig. 4 ], [Fig. 5 ], [Fig. 6 ]). Specifically, the diagrams were created using the Python code.
Informed Consent: Not applicable.
Data availability statement: All data and information used are included in this manuscript.