Radiologic Decision-Making for Imaging in Pulmonary Embolism: Accuracy and Reliability of Large Language Models—Bing, Claude, ChatGPT, and Perplexity

Pradosh Kumar Sarangi; Suvrankar Datta; M. Sarthak Swarup; Swaha Panda; Debasish Swapnesh Kumar Nayak; Archana Malik; Ananda Datta; Himel Mondal

doi:10.1055/s-0044-1787974

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00050590.xml

Share / Bookmark

Facebook X Linkedin Weibo

Download PDF

CC BY-NC-ND 4.0 · Indian J Radiol Imaging 2024; 34(04): 653-660
DOI: 10.1055/s-0044-1787974

Original Article

Radiologic Decision-Making for Imaging in Pulmonary Embolism: Accuracy and Reliability of Large Language Models—Bing, Claude, ChatGPT, and Perplexity

Pradosh Kumar Sarangi

¹Department of Radiodiagnosis, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India

,

Suvrankar Datta

²Department of Radiodiagnosis, All India Institute of Medical Sciences New Delhi, New Delhi, India

,

M. Sarthak Swarup

³Department of Radiodiagnosis, Vardhman Mahavir Medical College and Safdarjung Hospital New Delhi, New Delhi, India

,

Swaha Panda

⁴Department of Otorhinolaryngology and Head and Neck Surgery, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India

,

Debasish Swapnesh Kumar Nayak

⁵Department of Computer Science and Engineering, SOET, Centurion University of Technology and Management, Bhubaneswar, Odisha, India

,

Archana Malik

⁶Department of Pulmonary Medicine, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India

,

Ananda Datta

⁶Department of Pulmonary Medicine, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India

,

Himel Mondal

⁷Department of Physiology, All India Institute of Medical Sciences Deoghar, Deoghar, Jharkhand, India

› Author Affiliations

› Further Information

Also available at

Abstract
Full Text
References

Permissions and Reprints

Abstract

Background Artificial intelligence chatbots have demonstrated potential to enhance clinical decision-making and streamline health care workflows, potentially alleviating administrative burdens. However, the contribution of AI chatbots to radiologic decision-making for clinical scenarios remains insufficiently explored. This study evaluates the accuracy and reliability of four prominent Large Language Models (LLMs)—Microsoft Bing, Claude, ChatGPT 3.5, and Perplexity—in offering clinical decision support for initial imaging for suspected pulmonary embolism (PE).

Methods Open-ended (OE) and select-all-that-apply (SATA) questions were crafted, covering four variants of case scenarios of PE in-line with the American College of Radiology Appropriateness Criteria. These questions were presented to the LLMs by three radiologists from diverse geographical regions and setups. The responses were evaluated based on established scoring criteria, with a maximum achievable score of 2 points for OE responses and 1 point for each correct answer in SATA questions. To enable comparative analysis, scores were normalized (score divided by the maximum achievable score).

Result In OE questions, Perplexity achieved the highest accuracy (0.83), while Claude had the lowest (0.58), with Bing and ChatGPT each scoring 0.75. For SATA questions, Bing led with an accuracy of 0.96, Perplexity was the lowest at 0.56, and both Claude and ChatGPT scored 0.6. Overall, OE questions saw higher scores (0.73) compared to SATA (0.68). There is poor agreement among radiologists' scores for OE (Intraclass Correlation Coefficient [ICC] = −0.067, p = 0.54), while there is strong agreement for SATA (ICC = 0.875, p < 0.001).

Conclusion The study revealed variations in accuracy across LLMs for both OE and SATA questions. Perplexity showed superior performance in OE questions, while Bing excelled in SATA questions. OE queries yielded better overall results. The current inconsistencies in LLM accuracy highlight the importance of further refinement before these tools can be reliably integrated into clinical practice, with a need for additional LLM fine-tuning and judicious selection by radiologists to achieve consistent and reliable support for decision-making.

Keywords

large language model - American College of Radiology Appropriateness Criteria - pulmonary embolism - Bing - ChatGPT - Claude - Perplexity

Publication History

Article published online:
04 July 2024

© 2024. Indian Radiological Association. This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/)

Thieme Medical and Scientific Publishers Pvt. Ltd.
A-12, 2nd Floor, Sector 2, Noida-201301 UP, India

References
1 Panayides AS, Amini A, Filipovic ND. et al. AI in medical imaging informatics: current challenges and future directions. IEEE J Biomed Health Inform 2020; 24 (07) 1837-1857

Crossref PubMed Search in Google Scholar
2 Bera K, O'Connor G, Jiang S, Tirumani SH, Ramaiya N. Analysis of ChatGPT publications in radiology: literature so far. Curr Probl Diagn Radiol 2024; 53 (02) 215-225

Crossref PubMed Search in Google Scholar
3 Tippareddy C, Jiang S, Bera K. et al. Radiology reading room for the future: harnessing the power of large language models like ChatGPT. Curr Probl Diagn Radiol 2023; (e-pub ahead of print)

Crossref PubMed Search in Google Scholar
4 Amin KS, Davis MA, Doshi R, Haims AH, Khosla P, Forman HP. Accuracy of ChatGPT, Google Bard, and Microsoft Bing for simplifying radiology reports. Radiology 2023; 309 (02) e232561

Crossref PubMed Search in Google Scholar
5 Jeblick K, Schachtner B, Dexl J. et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol 2024; 34 (05) 2817-2825

Crossref PubMed Search in Google Scholar
6 Elkassem AA, Smith AD. Potential use cases for ChatGPT in radiology reporting. Am J Roentgenol 2023; 221 (03) 373-376

Crossref PubMed Search in Google Scholar
7 Sarangi PK, Lumbani A, Swarup MS. et al. Assessing ChatGPT's proficiency in simplifying radiological reports for healthcare professionals and patients. Cureus 2023; 15 (12) e50881

PubMed Search in Google Scholar
8 Gertz RJ, Bunck AC, Lennartz S. et al. GPT-4 for automated determination of radiological study and protocol based on radiology request forms: a feasibility study. Radiology 2023; 307 (05) e230877

Crossref PubMed Search in Google Scholar
9 Haver HL, Ambinder EB, Bahl M, Oluyemi ET, Jeudy J, Yi PH. Appropriateness of breast cancer prevention and screening recommendations provided by ChatGPT. Radiology 2023; 307 (04) e230424

Crossref PubMed Search in Google Scholar
10 Haver HL, Lin CT, Sirajuddin A, Yi PH, Jeudy J. Use of ChatGPT, GPT-4, and Bard to improve readability of ChatGPT's answers to common questions about lung cancer and lung cancer screening. Am J Roentgenol 2023; 221 (05) 701-704

Crossref PubMed Search in Google Scholar
11 Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A. How AI responds to common lung cancer questions: ChatGPT vs Google Bard. Radiology 2023; 307 (05) e230922

Crossref PubMed Search in Google Scholar
12 Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology 2023; 307 (05) e230582

Crossref PubMed Search in Google Scholar
13 Sarangi PK, Narayan RK, Mohakud S, Vats A, Sahani D, Mondal H. Assessing the capability of ChatGPT, Google Bard, and Microsoft Bing in solving radiology case vignettes. Indian J Radiol Imaging 2023; 34 (02) 276-282

PubMed Search in Google Scholar
14 Sarangi PK, Irodi A, Panda S, Nayak DSK, Mondal H. Radiological differential diagnoses based on cardiovascular and thoracic imaging patterns: perspectives of four large language models. Indian J Radiol Imaging 2023; 34 (02) 269-275

PubMed Search in Google Scholar
15 Kottlors J, Bratke G, Rauen P. et al. Feasibility of differential diagnosis based on imaging patterns using a large language model. Radiology 2023; 308 (01) e231167

Crossref PubMed Search in Google Scholar
16 Sun Z, Ong H, Kennedy P. et al. Evaluating GPT4 on impressions generation in radiology reports. Radiology 2023; 307 (05) e231259

Crossref PubMed Search in Google Scholar
17 Rau A, Rau S, Zoeller D. et al. A context-based chatbot surpasses trained radiologists and generic ChatGPT in following the ACR Appropriateness Guidelines. Radiology 2023; 308 (01) e230970

Crossref PubMed Search in Google Scholar
18 Patil NS, Huang RS, van der Pol CB, Larocque N. Using artificial intelligence chatbots as a radiologic decision-making tool for liver imaging: Do ChatGPT and Bard communicate information consistent with the ACR Appropriateness Criteria?. J Am Coll Radiol 2023; 20 (10) 1010-1013

Crossref PubMed Search in Google Scholar
19 Rao A, Kim J, Kamineni M. et al. Evaluating GPT as an adjunct for radiologic decision making: GPT-4 versus GPT-3.5 in a breast imaging pilot. J Am Coll Radiol 2023; 20 (10) 990-997

Crossref PubMed Search in Google Scholar
20 Doddi S, Hibshman T, Salichs O. et al. Assessing appropriate responses to ACR urologic imaging scenarios using ChatGPT and Bard. Curr Probl Diagn Radiol 2024; 53 (02) 226-229

Crossref PubMed Search in Google Scholar
21 Markus T, Saban M, Sosna J. et al. Does clinical decision support system promote expert consensus for appropriate imaging referrals? Chest-abdominal-pelvis CT as a case study. Insights Imaging 2023; 14 (01) 45

Crossref PubMed Search in Google Scholar
22 European Society of Radiology (ESR). Methodology for ESR iGuide content. Insights Imaging 2019; 10 (01) 32

Crossref PubMed Search in Google Scholar
23 Gabelloni M, Di Nasso M, Morganti R. et al. Application of the ESR iGuide clinical decision support system to the imaging pathway of patients with hepatocellular carcinoma and cholangiocarcinoma: preliminary findings. Radiol Med (Torino) 2020; 125 (06) 531-537

Crossref PubMed Search in Google Scholar
24 Kjelle E, Andersen ER, Krokeide AM. et al. Characterizing and quantifying low-value diagnostic imaging internationally: a scoping review. BMC Med Imaging 2022; 22 (01) 73

Crossref PubMed Search in Google Scholar
25 Gamble JL, Ferguson D, Yuen J, Sheikh A. Limitations of GPT-3.5 and GPT-4 in Applying Fleischner Society Guidelines to incidental lung nodules. Can Assoc Radiol J 2023; 75: 412-416

Crossref PubMed Search in Google Scholar
26 Sarangi PK, Mondal H. Response generated by large language models depends on the structure of the prompt. Indian J Radiol Imaging 2024; 34 (03) 574-575

Thieme Connect PubMed Search in Google Scholar

Subscribe to RSS

Share / Bookmark

Radiologic Decision-Making for Imaging in Pulmonary Embolism: Accuracy and Reliability of Large Language Models—Bing, Claude, ChatGPT, and Perplexity

Abstract

Keywords

Publication History

References