CC BY-NC-ND 4.0 · Methods Inf Med 2022; 61(S 02): e51-e63
DOI: 10.1055/a-1862-0421
Original Article

A Systematic Approach to Configuring MetaMap for Optimal Performance

Xia Jing
1   Department of Public Health Sciences, College of Behavioral, Social and Health Sciences, Clemson University, Clemson, South Carolina, United States
,
Akash Indani
2   School of Computing, College of Engineering, Computing and Applied Sciences, Clemson University, Clemson, South Carolina, United States
,
Nina Hubig
2   School of Computing, College of Engineering, Computing and Applied Sciences, Clemson University, Clemson, South Carolina, United States
,
Hua Min
3   Department of Health Administration and Policy, College of Health and Human Services, George Mason University, Fairfax, Virginia, United States
,
Yang Gong
4   School of Biomedical Informatics, The University of Texas Health Sciences Center at Houston, Houston, Texas, United States
,
James J. Cimino
5   Informatics Institute, The University of Alabama at Birmingham, Birmingham, Alabama, United States
,
Dean F. Sittig
4   School of Biomedical Informatics, The University of Texas Health Sciences Center at Houston, Houston, Texas, United States
,
Lior Rennert
1   Department of Public Health Sciences, College of Behavioral, Social and Health Sciences, Clemson University, Clemson, South Carolina, United States
,
David Robinson
6   Independent Consultant, Cumbria, United kingdom
,
Paul Biondich
7   Department of Pediatrics, Clem McDonald Biomedical Informatics Center, Regenstrief Institute, Indiana University School of Medicine, Indianapolis, Indiana, United States
,
Adam Wright
8   Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, United States
,
Christian Nøhr
9   Department of Planning, Faculty of Engineering, Aalborg University, Aalborg, Denmark
,
Timothy Law
10   Ohio Musculoskeletal and Neurologic Institute, Ohio University, Athens, Ohio, United States
,
Arild Faxvaag
11   Department of Neuromedicine and Movement Science, Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology, Trondheim, Norway
,
Ronald Gimbel
1   Department of Public Health Sciences, College of Behavioral, Social and Health Sciences, Clemson University, Clemson, South Carolina, United States
› Author Affiliations
Funding This work is supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award No. R01GM138589 and partially under P20 GM121342. We acknowledge Clemson University for the generous allotment of computing time on the Palmetto Cluster.

Abstract

Background MetaMap is a valuable tool for processing biomedical texts to identify concepts. Although MetaMap is highly configurative, configuration decisions are not straightforward.

Objective To develop a systematic, data-driven methodology for configuring MetaMap for optimal performance.

Methods MetaMap, the word2vec model, and the phrase model were used to build a pipeline. For unsupervised training, the phrase and word2vec models used abstracts related to clinical decision support as input. During testing, MetaMap was configured with the default option, one behavior option, and two behavior options. For each configuration, cosine and soft cosine similarity scores between identified entities and gold-standard terms were computed for 40 annotated abstracts (422 sentences). The similarity scores were used to calculate and compare the overall percentages of exact matches, similar matches, and missing gold-standard terms among the abstracts for each configuration. The results were manually spot-checked. The precision, recall, and F-measure (β =1) were calculated.

Results The percentages of exact matches and missing gold-standard terms were 0.6–0.79 and 0.09–0.3 for one behavior option, and 0.56–0.8 and 0.09–0.3 for two behavior options, respectively. The percentages of exact matches and missing terms for soft cosine similarity scores exceeded those for cosine similarity scores. The average precision, recall, and F-measure were 0.59, 0.82, and 0.68 for exact matches, and 1.00, 0.53, and 0.69 for missing terms, respectively.

Conclusion We demonstrated a systematic approach that provides objective and accurate evidence guiding MetaMap configurations for optimizing performance. Combining objective evidence and the current practice of using principles, experience, and intuitions outperforms a single strategy in MetaMap configurations. Our methodology, reference codes, measurements, results, and workflow are valuable references for optimizing and configuring MetaMap.

Supplementary Material



Publication History

Accepted Manuscript online:
25 May 2022

Article published online:
19 September 2022

© 2022. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/)

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

 
  • References

  • 1 Chen Y, Elenee Argentinis JD, Weber G. IBM Watson: how cognitive computing can be applied to big data challenges in life sciences research. Clin Ther 2016; 38 (04) 688-701
  • 2 Ferrucci D, Levas A, Bagchi S. et al. Watson: beyond jeopardy!. Artif Intell 2013; 199–200: 93-105
  • 3 Chen W, Hu Y, Zhang X. et al. Causal risk factor discovery for severe acute kidney injury using electronic health records. BMC Med Inform Decis Mak 2018; 18 (Suppl. 01) 13
  • 4 Zhou L, Blackley SV, Kowalski L. et al. Analysis of errors in dictated clinical documents assisted by speech recognition software and professional transcriptionists. JAMA Netw Open 2018; 1 (03) e180530
  • 5 Wang J, Lavender M, Hoque E, Brophy P, Kautz H. A patient-centered digital scribe for automatic medical documentation. JAMIA Open 2021; 4 (01) b003
  • 6 Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp 2001; 17-21
  • 7 Aronson AR, Lang FM. An overview of MetaMap: historical perspective and recent advances. J Am Med Inform Assoc 2010; 17 (03) 229-236
  • 8 Savery ME, Rogers WJ, Pillai M, Mork JG, Demner-Fushman D. Chemical entity recognition for MEDLINE indexing. AMIA Jt Summits Transl Sci Proc 2020; 2020: 561-568
  • 9 Chiaramello E, Paglialonga A, Pinciroli F, Tognola G. Attempting to use MetaMap in clinical practice: a feasibility study on the identification of medical concepts from italian clinical notes. Stud Health Technol Inform 2016; 228: 28-32
  • 10 Chapman WW, Fiszman M, Dowling JN, Chapman BE, Rindflesch TC. Identifying respiratory findings in emergency department reports for biosurveillance using MetaMap. Stud Health Technol Inform 2004; 107 (Pt 1): 487-491
  • 11 Peng J, Zhao M, Havrilla J. et al. Natural language processing (NLP) tools in extracting biomedical concepts from research articles: a case study on autism spectrum disorder. BMC Med Inform Decis Mak 2020; 20 (Suppl. 11) 322
  • 12 Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 2004; 32 (Database issue): D267-D270
  • 13 Pires DF, Teixeira CAC, Ruiz EES. A UMLS interoperable solution to support collaborative diagnosis decision making over the internet. Paper presented at: Proceedings of the 2008 ACM symposium on Applied computing; 2008; Fortaleza, Ceara, Brazil. Accessed June 10, 2022 at: https://doi-org.libproxy.clemson.edu/10.1145/1363686.1364009
  • 14 Warren JJ, Matney SA, Foster ED, Auld VA, Roy SL. Toward Interoperability: a new resource to support nursing terminology standards. Comput Inform Nurs 2015; 33 (12) 515-519
  • 15 Bhupatiraju RT, Fung KW, Bodenreider O. MetaMap Lite in Excel: biomedical named-entity recognition for non-technical users. Stud Health Technol Inform 2017; 245: 1252
  • 16 Demner-Fushman D, Rogers WJ, Aronson AR. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. J Am Med Inform Assoc 2017; 24 (04) 841-844
  • 17 Pratt W, Yetisgen-Yildiz M. A study of biomedical concept identification: MetaMap vs. people. AMIA Annu Symp Proc 2003; 2003: 529-533
  • 18 Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. 2013 Proceedings of Workshop at ICLR, 2013 https://arxiv.org/pdf/1301.3781.pdf
  • 19 Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'13) Curran Associates Inc; Red Hook, NY, USA: 3111-3119
  • 20 Manning CD, Raghavan P, Schütze H. An Introduction to Information Retrieval. Cambridge: Cambridge University Press; 2009
  • 21 Sidorov G, Gelbukh A, Gómez-Adorno H, Pinto D. Soft similarity and soft cosine measure: similarity of features in vector space model. Comput Sist 2014; 18 (03) 491-504
  • 22 Lang Fc-M. MetaMap Usage Notes. 2016. Accessed Aug 30, 2021 at: https://metamap.nlm.nih.gov/Docs/MM_2016_Usage.pdf
  • 23 National Library of Medicine. MetaMap-A tool for recognizing UMLS concepts in text. Accessed Sept 27, 2019 at: https://metamap.nlm.nih.gov
  • 24 Novotný V. Implementation notes for the soft cosine measure. Paper presented at: The 27th ACM International Conference on Information and Knowledge Management; 2018; Torun, Italy. Accessed June 10, 2022 at: https://doi.org/10.1145/3269206.3269317
  • 25 Friedman C, Hripcsak G. Evaluating natural language processors in the clinical domain. Methods Inf Med 1998; 37 (4-5): 334-344
  • 26 McHugh ML. Interrater reliability: the kappa statistic. Biochem Med (Zagreb) 2012; 22 (03) 276-282
  • 27 Chen Y, Lask TA, Mei Q. et al. An active learning-enabled annotation system for clinical named entity recognition. BMC Med Inform Decis Mak 2017; 17 (Suppl. 02) 82
  • 28 Wei Q, Chen Y, Salimi M. et al. Cost-aware active learning for named entity recognition in clinical text. J Am Med Inform Assoc 2019; 26 (11) 1314-1322
  • 29 Merchant O, Tellur S, Jing X. A pilot evaluation of the performance of metamap for processing clinical actionable genomics texts. AMIA Summit 2021; Virtual, 2021 857
  • 30 Marrero M, Sánchez-Cuadrado S, Lara JM, Andreadakis G. Evaluation of named entity extraction systems. Research in Computing Science 2009; 41: 47-58
  • 31 Tsai RT-H, Wu S-H, Chou W-C. et al. Various criteria in the evaluation of biomedical named entity recognition. BMC Bioinformatics 2006; 7 (01) 92
  • 32 Song H-J, Jo B-C, Park C-Y, Kim J-D, Kim Y-S. Comparison of named entity recognition methodologies in biomedical documents. Biomed Eng Online 2018; 17 (02, Suppl 2): 158
  • 33 Bouma G. Normalized (pointwise) mutual information in collocation extraction. 2009. Accessed June 10, 2022 at: https://svn.spraakdata.gu.se/repos/gerlof/pub/www/Docs/npmi-pfd.pdf
  • 34 Divita G, Tse T, Roth L. Failure analysis of MetaMap Transfer (MMTx). Stud Health Technol Inform 2004; 107 (Pt 2): 763-767