RSS-Feed abonnieren
DOI: 10.1055/a-2594-7085
Evaluating the Diagnostic Accuracy of ChatGPT-4.0 for Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters
Bewertung der diagnostischen Genauigkeit von ChatGPT-4.0 bei der Klassifikation multimodaler muskuloskelettaler Läsionen: eine vergleichende Studie mit menschlichen AuswerternAutor*innen
Abstract
Purpose
Novel artificial intelligence tools have the potential to significantly enhance productivity in medicine, while also maintaining or even improving treatment quality. In this study, we aimed to evaluate the current capability of ChatGPT-4.0 to accurately interpret multimodal musculoskeletal tumor cases.
Materials and Methods
We created 25 cases, each containing images from X-ray, computed tomography, magnetic resonance imaging, or scintigraphy. ChatGPT-4.0 was tasked with classifying each case using a six-option, two-choice question, where both a primary and a secondary diagnosis were allowed. For performance evaluation, human raters also assessed the same cases.
Results
When only the primary diagnosis was taken into account, the accuracy of human raters was greater than that of ChatGPT-4.0 by a factor of nearly 2 (87% vs. 44%). However, in a setting that also considered secondary diagnoses, the performance gap shrank substantially (accuracy: 94% vs. 71%). Power analysis relying on Cohen’s w confirmed the adequacy of the sample set size (n: 25).
Conclusion and Key Points
The tested artificial intelligence tool demonstrated lower performance than human raters. Considering factors such as speed, constant availability, and potential future improvements, it appears plausible that artificial intelligence tools could serve as valuable assistance systems for doctors in future clinical settings.
Key Points
-
ChatGPT-4.0 classifies musculoskeletal cases using multimodal imaging inputs.
-
Human raters outperform AI in primary diagnosis accuracy by a factor of nearly two.
-
Including secondary diagnoses improves AI performance and narrows the gap.
-
AI demonstrates potential as an assistive tool in future radiological workflows.
-
Power analysis confirms robustness of study findings with the current sample size.
Citation Format
-
Bosbach WA, Schoeni L, Beisbart C et al. Evaluating the Diagnostic Accuracy of ChatGPT-4.0 for Classifying Multimodal Musculoskeletal Masses: A Comparative Study with Human Raters. Rofo 2025; DOI 10.1055/a-2594-7085
Zusammenfassung
Ziel
Neue künstliche Intelligenz (KI)-Werkzeuge haben das Potenzial, die Produktivität in der Medizin erheblich zu steigern und gleichzeitig die Behandlungsqualität aufrechtzuerhalten oder sogar zu verbessern. In dieser Studie wollten wir die aktuelle Fähigkeit von ChatGPT-4.0 zur präzisen Interpretation multimodaler muskuloskelettaler Tumorfälle evaluieren.
Materialien und Methoden
Wir erstellten 25 Fälle, die jeweils Bilder aus Röntgenaufnahmen, Computertomografie, Magnetresonanztomografie oder Szintigrafie enthielten. ChatGPT-4.0 wurde mit der Klassifikation jedes Falls anhand einer sechsoptionalen, zweiauswahlbasierten Frage beauftragt, wobei sowohl eine primäre als auch eine sekundäre Diagnose erlaubt waren. Zur Leistungsbewertung analysierten menschliche Beurteiler dieselben Fälle.
Ergebnisse
Wurde nur die primäre Diagnose berücksichtigt, war die Genauigkeit der menschlichen Beurteiler fast doppelt so hoch wie die von ChatGPT-4.0 (87% vs. 44%). In einem Szenario, das auch sekundäre Diagnosen berücksichtigte, verringerte sich die Leistungslücke jedoch deutlich (Genauigkeit: 94% vs. 71%). Eine Power-Analyse basierend auf Cohens w bestätigte die Angemessenheit der Stichprobengröße (n = 25).
Schlussfolgerung und Kernaussagen
Das getestete KI-Werkzeug zeigte eine geringere Leistung als menschliche Beurteiler. Angesichts von Faktoren wie Geschwindigkeit, ständiger Verfügbarkeit und potenziellen zukünftigen Verbesserungen erscheint es jedoch plausibel, dass KI-Werkzeuge in zukünftigen klinischen Umgebungen als wertvolle Assistenzsysteme für Ärzte dienen könnten.
Kernaussagen
-
ChatGPT-4.0 klassifiziert muskuloskelettale Fälle anhand multimodaler Bildgebungsdaten.
-
Menschliche Beurteiler übertreffen die KI bei der primären Diagnosestellung mit nahezu doppelter Genauigkeit.
-
Die Berücksichtigung sekundärer Diagnosen verbessert die KI-Leistung und verringert die Leistungsdifferenz.
-
KI zeigt Potenzial als unterstützendes Werkzeug in zukünftigen radiologischen Arbeitsabläufen.
-
Eine Power-Analyse bestätigt die Aussagekraft der Studienergebnisse bei gegebener Stichprobengröße.
Keywords
Clinical Decision Support - Diagnostic Accuracy - Artificial Intelligence - Musculoskeletal TumorsPublikationsverlauf
Eingereicht: 09. Januar 2025
Angenommen nach Revision: 18. April 2025
Artikel online veröffentlicht:
03. Juni 2025
© 2025. Thieme. All rights reserved.
Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany
-
References
- 1 Sutherland G, Russell N, Gibbard R. et al. The Value of Radiology, Part II – The Conference Board of Canada. Ottawa, CAN; 2019.
- 2 McCarthy J, Minsky ML, Rochester N. et al. A Proposal For The Dartmouth Summer Research Project On Artificial Intelligence [Internet]. 1955 [cited 2021 Oct 30]. p. 1–13. http://jmc.stanford.edu/articles/dartmouth/dartmouth.pdf
- 3 Kagiyama N, Shrestha S, Farjo PD. et al. Artificial Intelligence: Practical Primer for Clinical Research in Cardiovascular Disease. J Am Heart Assoc 2019; 8: 1-12
- 4 Peters AA, Wiescholek N, Müller M. et al. Impact of artificial intelligence assistance on pulmonary nodule detection and localization in chest CT: a comparative study among radiologists of varying experience levels. Sci Rep 2024; 14 (01) 22447
- 5 Peters AA, Munz J, Klaus JB. et al. Impact of Simulated Reduced-Dose Chest CT on Diagnosing Pulmonary T1 Tumors and Patient Management. Diagnostics 2024; 14 (15)
- 6 Borkowski K, Rossi C, Ciritsis A. et al. Fully automatic classification of breast MRI background parenchymal enhancement using a transfer learning approach. Medicine (Baltimore) 2020; 99 (29) e21243
- 7 Ramedani S, Ramedani M, Von Tengg-Kobligk H. et al. A Deep Learning-based Fully Automated Approach for Body Composition Analysis in 3D Whole Body Dixon MRI. In: 2023 IEEE 19th International Conference on Intelligent Computer Communication and Processing (ICCP). 2023: 287-292
- 8 Urban G, Porhemmat S, Stark M. et al. Classifying shoulder implants in X-ray images using deep learning. Comput Struct Biotechnol J 2020; 18: 967-972
- 9 Bosbach WA, Senge JF, Nemeth B. et al. Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier. Curr Probl Diagn Radiol 2023; 53 (01) 102-110
- 10 Bosbach WA, Senge JF, Nemeth B. et al. Online supplement to manuscript: “Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier.” Current problems in diagnostic radiology (2023). zenodo. 2023
- 11 Senge JF, Mc Murray MT, Haupt F. et al. ChatGPT may free time needed by the interventional radiologist for administration/documentation: A study on the RSNA PICC line reporting template. Swiss J Radiol Nucl Med 2024; 7 (02) 1-14
- 12 Senge JF, Mc Murray MT, Haupt F. et al. Online supplement to manuscript: “ChatGPT may free time needed by the interventional radiologist for administration/documentation: A study on the RSNA PICC line reporting template.” zenodo. 2023
- 13 Garni SN, Mertineit N, Nöldge G. et al. Regulatory Needs for Radiation Protection Devices based upon Artificial Intelligence – State task or leave unregulated?. Swiss J Radiol Nucl Med 2024; 5 (01) 5
- 14 Bosbach WA, Merdes KC, Jung B. et al. Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Top Magn Reson Imaging 2024; 33: e0313
- 15 Bosbach WA, Merdes KC, Jung B. et al. Open access supplement to the publication: Bosbach, W. A., et al. (2024). Deep learning reconstruction of accelerated MRI: False positive cartilage delamination inserted in MRI arthrography under traction. Topics in Magnetic Resonance Imaging. [accepted]. [Internet]. figshare. 2024
- 16 Glowacka D, Howes A, Jokinen JP. et al. RL4HCI: Reinforcement Learning for Humans, Computers, and Interaction. Ext Abstr 2021 CHI Conf Hum Factors Comput Syst 2021; 1-3
- 17 Schulman J, Wolski F, Dhariwal P. et al. Proximal Policy Optimization Algorithms. arXiv 2017; 1707
- 18 OpenAI LLC, editor. ChatGPT-4.0 [Internet]. 2024 [cited 2024 Aug 29]. Available from: chat.openai.com.
- 19 sklearn.metrics.precision_score [Internet]. scikit-learn 1.5.1 documentation. 2024 [cited 2024 Sep 1]. Available from:. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html
- 20 Gwet K, Fergadis A. irrCAC – Chance-corrected Agreement Coefficients [Internet]. 2023 [cited 2025 Sep 3]. Available from:. irrcac.readthedocs.io/en/latest/usage/usage_raw_data.html
- 21 Wongpakaran N, Wongpakaran T, Wedding D. et al. A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: A study conducted with personality disorder samples. BMC Med Res Methodol 2013; 13 (01) 1-7
- 22 Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977; 33 (01) 159-174
- 23 sklearn.metrics.confusion_matrix [Internet]. scikit-learn 1.5.1 documentation. 2024 [cited 2024 Sep 1]. Available from:. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
- 24 scipy.stats.chisquare [Internet]. SciPy. [cited 2025 Mar 10]. Available from:. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html
- 25 statsmodels.stats.gof.chisquare_effectsize [Internet]. statsmodels 0.15.0 (+617). 2025 [cited 2025 Mar 9]. Available from. https://www.statsmodels.org/dev/generated/statsmodels.stats.gof.chisquare_effectsize.html#statsmodels.stats.gof.chisquare_effectsize
- 26 Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. New York, NY, USA; 1988.
- 27 statsmodels.stats.power.GofChisquarePower [Internet]. statsmodels 0.15.0 (+581). 2025 [cited 2025 Jan 2]. Available from: . https://www.statsmodels.org/dev/generated/statsmodels.stats.power.GofChisquarePower.html
- 28 Goisauf M, Cano Abadía M. Ethics of AI in Radiology: A Review of Ethical and Societal Implications. Front Big Data 2022; 5: 1-13
- 29 Sparrow R, Hatherley J. The Promise and Perils of AI in Medicine. Int J Chinese Comp Philos Med 2019; 17 (02) 79-109
