CC BY 4.0 · Methods Inf Med
DOI: 10.1055/a-2521-4372
Original Article

Deciphering Abbreviations in Malaysian Clinical Notes Using Machine Learning

Ismat Mohd Sulaiman
1   Health Informatics Centre, Planning Division, Ministry of Health Malaysia, Putrajaya, Malaysia
,
2   Academy of Sciences, Kuala Lumpur, Malaysia
,
Sameem Abdul Kareem
3   Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Wilayah Persekutuan, Malaysia
,
Abdul Aziz Latip
4   MIMOS Berhad, Kuala Lumpur, Malaysia
› Institutsangaben

Abstract

Objective This is the first Malaysian machine learning model to detect and disambiguate abbreviations in clinical notes. The model has been designed to be incorporated into MyHarmony, a natural language processing system, that extracts clinical information for health care management. The model utilizes word embedding to ensure feasibility of use, not in real-time but for secondary analysis, within the constraints of low-resource settings.

Methods A Malaysian clinical embedding, based on Word2Vec model, was developed using 29,895 electronic discharge summaries. The embedding was compared against conventional rule-based and FastText embedding on two tasks: abbreviation detection and abbreviation disambiguation. Machine learning classifiers were applied to assess performance.

Results The Malaysian clinical word embedding contained 7 million word tokens, 24,352 unique vocabularies, and 100 dimensions. For abbreviation detection, the Decision Tree classifier augmented with the Malaysian clinical embedding showed the best performance (F-score of 0.9519). For abbreviation disambiguation, the classifier with the Malaysian clinical embedding had the best performance for most of the abbreviations (F-score of 0.9903).

Conclusion Despite having a smaller vocabulary and dimension, our local clinical word embedding performed better than the larger nonclinical FastText embedding. Word embedding with simple machine learning algorithms can decipher abbreviations well. It also requires lower computational resources and is suitable for implementation in low-resource settings such as Malaysia. The integration of this model into MyHarmony will improve recognition of clinical terms, thus improving the information generated for monitoring Malaysian health care services and policymaking.



Publikationsverlauf

Eingereicht: 30. August 2024

Angenommen: 15. Januar 2025

Accepted Manuscript online:
22. Januar 2025

Artikel online veröffentlicht:
11. Februar 2025

© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/)

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany

 
  • References

  • 1 Ahmad MKS, Sakri MSM, Sulaiman IM. et al. MyHarmony: generating statistics from clinical text for monitoring clinical quality indicators. In: 62nd ISI World Statistic Congress. 2019. , 129. Department of Statistics Malaysia (DOSM);
  • 2 Latip AA, Domingo MST, Sulaiman IM. et al. Automated SNOMED CT mapping of clinical discharge summary data for cardiology queries in clinical facilities. International Journal of Pharma Medicine and Biological Sciences 2021; 10: 8-16
  • 3 Ministry of Health Malaysia. Malaysian Health Data Warehouse (MyHDW) 2015–2016 Start up: Initiation. Selangor: Ministry of Health Malaysia; 2017
  • 4 Hamiel U, Hecht I, Nemet A. et al. Frequency, comprehension and attitudes of physicians towards abbreviations in the medical record. Postgrad Med J 2018; 94 (1111) 254-258
  • 5 Koh KC, Lau KM, Yusof SA. et al. A study on the use of abbreviations among doctors and nurses in the medical department of a tertiary hospital in Malaysia. Med J Malaysia 2015; 70 (06) 334-340
  • 6 Shilo L, Shilo G. Analysis of abbreviations used by residents in admission notes and discharge summaries. QJM 2018; 111 (03) 179-183
  • 7 Wu Y, Xu J, Zhang Y. et al. Clinical abbreviation disambiguation using neural word embeddings. In: Proceedings of BioNLP 15 Beijing,. China: 2015: 171-176 Association for Computational Linguistics;
  • 8 Khattak FK, Jeblee S, Pou-Prom C, Abdalla M, Meaney C, Rudzicz F. A survey of word embeddings for clinical text. J Biomed Inform 2019; 100S: 100057
  • 9 Martínez P, Jaber A. Disambiguating clinical abbreviations using pre-trained word embeddings. In: Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies. Porto Alegre, Brazil: 2021: 501-508
  • 10 Jaber A, Martínez P. Disambiguating clinical abbreviations using a one-fits-all classifier based on deep learning techniques. Methods Inf Med 2022; 61 (S 01): e28-e34
  • 11 Kugic A, Schulz S, Kreuzthaler M. Disambiguation of acronyms in clinical narratives with large language models. J Am Med Inform Assoc 2024; 31 (09) 2040-2046
  • 12 Hosseini M, Hosseini M, Javidan R. Leveraging large language models for clinical abbreviation disambiguation. J Med Syst 2024; 48 (01) 27
  • 13 Dalianis H. Clinical Text Mining: Secondary Use of Electronic Patient Records. 1st ed. Cham: Springer Publishing Company; , Incorporated; 2018
  • 14 Singhal K, Tao T, Juraj G. et al. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv 2023. ; abs/2305.09617.
  • 15 Devlin J, Chang M-W, Lee K. et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota, United States; 2019: 4171-4186
  • 16 Huang K, Altosaar J, Rajesh Ranganath R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. In: CHIL 2020 Workshop. Toronto,: 2019
  • 17 Lee J, Yoon W, Kim S. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020; 36 (04) 1234-1240
  • 18 Karabacak M, Margetis K. Embracing large language models for medical applications: opportunities and challenges. Cureus 2023; 15 (05) e39305
  • 19 Yang R, Tan TF, Lu W, Thirunavukarasu AJ, Ting DSW, Liu N. Large language models in health care: development, applications, and challenges. Health Care Sci 2023; 2 (04) 255-263
  • 20 Yang X, Chen A, PourNejatian N. et al. A large language model for electronic health records. NPJ Digit Med 2022; 5 (01) 194
  • 21 Mikolov T, Grave E, Bojanowski P. et al. Advances in pre-training distributed word representations. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan, 2018. , European Language Resources Association (ELRA).
  • 22 Pennington J, Socher R, Manning CD. GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (ed Alessandro Moschitti BP, Walter Daelemans), Doha, Qatar;. 2014: 1532-1543 . Association for Computational Linguistics.
  • 23 Bojanowski P, Grave E, Joulin A. et al. Enriching word vectors with subword information. Trans Assoc Comput Linguist 2017; 5: 135-146
  • 24 Chen Z, He Z, Liu X, Bian J. Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases. BMC Med Inform Decis Mak 2018; 18 (Suppl. 02) 65
  • 25 Harris ZS. Distributional structure. Word 2015; 1954 (10) 146-162
  • 26 Zhang Y, Chen Q, Yang Z, Lin H, Lu Z. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci Data 2019; 6 (01) 52
  • 27 Beltagy I, Lo K, Cohan A. Scibert: a pretrained language model for scientific text. arXiv preprint arXiv:190310676 2019
  • 28 Stenetorp P, Pyysalo S, Topić G. et al. brat: a web-based tool for NLP-assisted test annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (ed Segond F), Avignon, France;. 2012: 102-107 . Association for Computational Linguistics.
  • 29 Mikolov T, Sutskever I, Chen K. et al. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, Volume 2. Lake Tahoe, Nevada; 2013: 3111-3119 Curran Associates Inc;
  • 30 Joshi M, Pakhomov S, Pedersen T, Chute CG. A comparative study of supervised learning as applied to acronym expansion in clinical reports. AMIA Annu Symp Proc 2006; 2006: 399-403
  • 31 Moon S, Pakhomov S, Melton GB. Automated disambiguation of acronyms and abbreviations in clinical texts: window and training size considerations. AMIA Annu Symp Proc 2012; 2012: 1310-1319
  • 32 Moon S, Berster B-T, Xu H, Cohen T. Word sense disambiguation of clinical abbreviations with hyperdimensional computing. AMIA Annu Symp Proc 2013; 2013: 1007-1016
  • 33 Wu Y, Denny JC, Trent Rosenbloom S. et al. A long journey to short abbreviations: developing an open-source framework for clinical abbreviation recognition and disambiguation (CARD). J Am Med Inform Assoc 2017; 24 (e1): e79-e86
  • 34 Vo C, Cao T. Incremental abbreviation detection in clinical texts. In: 2019 Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging,. Vision & Pattern Recognition (icIVPR) Spokane, Washington, United States; 2019: 280-285
  • 35 Haibo H, Yang B, Garcia EA. et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). Hong Kong: 2008: 1322-1328
  • 36 Xu H, Stetson PD, Friedman C. A study of abbreviations in clinical notes. In: AMIA Annu Symp Proc 2007: 821-825
  • 37 Charbonnier J, Wartena C. Using Word Embeddings for Unsupervised Acronym Disambiguation. In: 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, United States; 2018: 2610-2619
  • 38 Bouzekri K, Sheikh Ahmad MK, Hamdan W. et al. Performing analytics on SNOMED CT coded database, Serdang Hospital use-case. In: SNOMED CT Expo 2015. Montevideo, 2015
  • 39 Kumah-Crystal YA, Pirtle CJ, Whyte HM, Goode ES, Anders SH, Lehmann CU. Electronic health record interactions through voice: a review. Appl Clin Inform 2018; 9 (03) 541-552