Methods Inf Med 2009; 48(06): 546-551
DOI: 10.3414/ME0615
Original Articles
Schattauer GmbH

Automated Identification of Diagnosis and Co-morbidity in Clinical Records

C. Cano
1   University of Granada, Department of Computer Science and A. I., Granada, Spain
,
A. Blanco
1   University of Granada, Department of Computer Science and A. I., Granada, Spain
,
L. Peshkin
2   Harvard University, Department of Systems Biology, Cambridge, MA, USA
› Author Affiliations
Further Information

Publication History

received: 14 November 2008

accepted: 20 May 2009

Publication Date:
17 January 2018 (online)

Summary

Objectives: Automated understanding of clinical records is a challenging task involving various legal and technical difficulties. Clinical free text is inherently redundant, unstructured, and full of acronyms, abbreviations and domain-specific language which make it challenging to mine automatically. There is much effort in the field focused on creating specialized ontology, lexicons and heuristics based on expert knowledge of the domain. However, ad-hoc solutions poorly generalize across diseases or diagnoses. This paper presents a successful approach for a rapid prototyping of a diagnosis classifier based on a popular computational linguistics platform.

Methods: The corpus consists of several hundred of full length discharge summaries provided by Partners Healthcare. The goal is to identify a diagnosis and assign co-morbidity. Our approach is based on the rapid implementation of a logistic regression classifier using an existing toolkit: LingPipe (http://alias-i.com/lingpipe). We implement and compare three different classifiers. The baseline approach uses character 5-grams as features. The second approach uses a bag-of-words representation enriched with a small additional set of features. The third approach reduces a feature set to the most informative features according to the information content.

Results: The proposed systems achieve high performance (average F-micro 0.92) for the task. We discuss the relative merit of the three classifiers. Supplementary material with detailed results is available at: http://decsai.ugr.es/~ccano/LR/supplementary_material/

Conclusions: We show that our methodology for rapid prototyping of a domain-unaware system is effective for building an accurate classifier for clinical records.

 
  • References

  • 1 Prokosch HU, Ganslandt T. Perspectives for medical informatics Reusing the electronic medical record for clinical research. Methods Inf Med 2009; 48 (01) 38-44.
  • 2 Pestian JP, Brew C, Hovermale DJ, Johnson N, Cohen KB. A shared task involving multi-label classification of clinical free text. Proc. of ACL BioNLP, Prague, June 2007
  • 3 Uzuner O, Luo Y, Szolovits P. Evaluating the Stateof- the-Art in Automatic De-identification. J Am Med Inform Assoc 2007; 14: 550-563.
  • 4 Uzuner O, Szolovits P, Kohane I. i2b2 workshop on natural language processing challenges for clinical records. Proc. of the Fall Symposium of the AMIA. Washington, DC, USA: 2006
  • 5 Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated Encoding of Clinical Documents Based on Natural Language Processing. J Am Med Inform Assoc 2004; 11 (05) 392-402.
  • 6 Long W. Lessons extracting diseases from discharge summaries. AMIA Annu Symp Proc. 2007 pp 478-482.
  • 7 Suzuki T, Yokoi H, Fujita S, Takabayashi K. Automatic DPC code selection from electronic medical records: text mining trial of discharge summary. Methods Inf Med 2008; 47 (06) 541-548.
  • 8 Denecke K. Semantic Structuring of and Information Extraction from Medical Documents Using the UMLS. Methods Inf Med 2008; 47 (05) 425-434.
  • 9 Zeng QT, Goryachev S, Weiss S, Sordo M, Murphy SN, Lazarus R. Extracting principal diagnosis, comorbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC Med Inform Decis Mak 2006; 6: 30.
  • 10 Pakhomov S, Bjornsen S, Hanson P, Smith S. Quality performance measurements using the text of electronic medical records. Med Decis Making 2008; 28 (04) 462-470.
  • 11 Lee CH, Wu CH, Yang HC. Text mining of clinical records for cancer diagnosis. Proc. 2nd ICICIC. 2007. p 172; IEEE Computer Society; Washington, DC, USA.:
  • 12 Uzuner O, Szolovits P, Kohane I. (organizers). 2nd i2b2 Shared-Task and Workshop Challenges in Natural Language Processing for Clinical Data. Obesity Challenge (A Shared-Task on Obesity): Who’s obese and what co-morbidities do they (definitely/ likely) have? 2008 https://www.i2b2.org/NLP/.
  • 13 Wellner B, Huyck M, Mardis S, Aberdeen J, Morgan A, Peshkin L, Yeh A, Hitzeman J, Hirschman L. Rapidly Retargetable Approaches to De-identification in Medical Records. J Am Med Inform Assoc 2007; 14: 564-573.
  • 14 Genkin A, Lewis AD, Madigan D. BMR: Bayesian Multinomial Regression Software. Available at: http://www.stat.rutgers.edu/~madigan/BMR/.
  • 15 Goodman J. Exponential Priors for Maximum Entropy Models. MSR Technical report, 2003 http://www.research.microsoft.com/%7Ejoshugo/exponentialprior-final.pdf.
  • 16 Bottou L. Online Algorithms and Stochastic Approximations. In: David Saad, editor. Online Learning and Neural Networks. Cambridge, UK: Cambridge University Press; 1998
  • 17 Foreman G. An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research 2003; 3: 1289-1305.