Subscribe to RSS
DOI: 10.1055/s-0044-1789618
Comparative Evaluation of Large Language Models for Translating Radiology Reports into Hindi
Funding None.Abstract
Objective The aim of this study was to compare the performance of four publicly available large language models (LLMs)—GPT-4o, GPT-4, Gemini, and Claude Opus—in translating radiology reports into simple Hindi.
Materials and Methods In this retrospective study, 100 computed tomography (CT) scan report impressions were gathered from a tertiary care cancer center. Reference translations of these impressions into simple Hindi were done by a bilingual radiology staff in consultation with a radiologist. Two distinct prompts were used to assess the LLMs' ability to translate these report impressions into simple Hindi. Translated reports were assessed by a radiologist for instances of misinterpretation, omission, and addition of fictitious information. Translation quality was assessed using Bilingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit ORdering (METEOR), Translation Edit Rate (TER), and character F-score (CHRF) scores. Statistical analyses were performed to compare the LLM performance across prompts.
Results Nine instances of misinterpretation and two instances of omission of information were found on radiologist evaluation of the total 800 LLM-generated translated report impressions. For prompt 1, Gemini outperformed others in BLEU (p < 0.001) and METEOR scores (p = 0.001), and was superior to GPT-4o and GPT-4 in TER and CHRF (p < 0.001), but comparable to Claude (p = 0.501 for TER and p = 0.90 for CHRF). For prompt 2, GPT-4o outperformed all others (p < 0.001) in all metrics. Prompt 2 yielded better BLEU, METEOR, and CHRF scores (p < 0.001), while prompt 1 had a better TER score (p < 0.001).
Conclusion While each LLM's effectiveness varied with prompt wording, all models demonstrated potential in translating and simplifying radiology report impressions.
Ethical Approval and Patient Consent
Ethical approval has been obtained from the institutional review board. Patient consent was not applicable for this study and was waived off by the ethics committee.
Publication History
Article published online:
04 September 2024
© 2024. Indian Radiological Association. This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Thieme Medical and Scientific Publishers Pvt. Ltd.
A-12, 2nd Floor, Sector 2, Noida-201301 UP, India
-
References
- 1 Patil S, Yacoub JH, Geng X, Ascher SM, Filice RW. Radiology reporting in the era of patient-centered care: how can we improve readability?. J Digit Imaging 2021; 34 (02) 367-373
- 2 Bruno B, Steele S, Carbone J, Schneider K, Posk L, Rose SL. Informed or anxious: patient preferences for release of test results of increasing sensitivity on electronic patient portals. Health Technol (Berl) 2022; 12 (01) 59-67
- 3 Bhayana R. Chatbots and large language models in radiology: a practical primer for clinical and research applications. Radiology 2024; 310 (01) e232756
- 4 Itri JN. Patient-centered radiology. Radiographics 2015; 35 (06) 1835-1846
- 5 Jeblick K, Schachtner B, Dexl J. et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol 2024; 34 (05) 2817-2825
- 6 Doshi R, Amin K, Khosla P, Bajaj S, Chheang S, Forman HP. Utilizing Large Language Models to Simplify Radiology Reports: a comparative analysis of ChatGPT3. 5, ChatGPT4. 0, Google Bard, and Microsoft Bing. medRxiv 2023; (e-pub ahead of print).
- 7 Lyu Q, Tan J, Zapadka ME. et al. Translating radiology reports into plain language using ChatGPT and GPT-4 with prompt learning: results, limitations, and potential. Vis Comput Ind Biomed Art 2023; 6 (01) 9
- 8 Sarangi PK, Lumbani A, Swarup MS. et al. Assessing ChatGPT's proficiency in simplifying radiological reports for healthcare professionals and patients. Cureus 2023; 15 (12) e50881
- 9 Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: a method for automatic evaluation of machine translation. Paper presented at: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; July 6–12, 2002; Philadelphia, PA
- 10 Lavie A, Denkowski M. The METEOR metric for automatic evaluation of machine translation. Mach Transl 2009; 23: 105-115
- 11 Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J. A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas. Cambridge, MA: Association for Machine Translation in the Americas; 2006: 223-231
- 12 Popović M. chrF: character n-gram F-score for automatic MT evaluation. In: Bojar O, Chatterjee R, Federmann C. et al., eds. Proceedings of the Tenth Workshop on Statistical Machine Translation [Internet]. Lisbon, Portugal: Association for Computational Linguistics; 2015: 392-395
- 13 Zhao WX, Zhou K, Li J. et al. A Survey of Large Language Models [Internet]. arXiv; 2023 . Accessed June 16, 2024 at: http://arxiv.org/abs/2303.18223
- 14 Fan L, Li L, Ma Z, Lee S, Yu H, Hemphill L. A bibliometric review of large language models research from 2017 to 2023 [Internet]. arXiv; 2023 . Accessed June 16, 2024 at: http://arxiv.org/abs/2304.02020
- 15 Li H, Moon JT, Iyer D. et al. Decoding radiology reports: Potential application of OpenAI ChatGPT to enhance patient understanding of diagnostic reports. Clin Imaging 2023; 101: 137-141
- 16 Amin KS, Davis MA, Doshi R, Haims AH, Khosla P, Forman HP. Accuracy of ChatGPT, Google Bard, and Microsoft Bing for simplifying radiology reports. Radiology 2023; 309 (02) e232561