Appl Clin Inform
DOI: 10.1055/a-2491-3872
Research Article

Extracting International Classification of Diseases Codes from Clinical Documentation using Large Language Models

Ashley Simmons
,
Kullaya Takkavatakarn
,
Megan McDougal
,
Brian Dilcher
1   Emergency Medicine, West Virginia University School of Medicine, Morgantown, United States (Ringgold ID: RIN12355)
,
Jami Pincavitch
2   Orthopaedics, West Virginia University Health Sciences Center, Morgantown, United States (Ringgold ID: RIN53422)
,
Lukas Meadows
,
Justin Kauffman
,
Eyal Klang
,
Rebecca Wig
,
Gordon Stephen Smith
3   Epidemiology, West Virginia University Health Sciences Center, Morgantown, United States (Ringgold ID: RIN53422)
,
Ali Soroush
,
Robert Freeman
,
Donald Apakama
,
Alexander Charney
,
Roopa Kohli-Seth
4   Surgery - Division of Critical Care, Mount Sinai Hospital / Icahn School of Medicine at Mount Sinai, New York, United States
5   mount sinai school of medicine, New York, United States
,
Girish Nadkarni
,
Ankit Sakhuja
› Author Affiliations
Supported by: National Institute of Diabetes and Digestive and Kidney Diseases K08DK131286

Background: Large language models (LLMs) have shown promise in various professional fields, including medicine and law. However, their performance in highly specialized tasks, such as extracting ICD-10-CM codes from patient notes, remains underexplored. Objective: The primary objective was to evaluate and compare the performance of ICD-10-CM code extraction by different LLMs with that of human coder. Methods: We evaluated performance of six LLMs (GPT-3.5, GPT-4, Claude 2.1, Claude 3, Gemini Advanced, and Llama 2-70b) in extracting ICD-10-CM codes against human coder. We used deidentified inpatient notes from American Health Information Management Association Vlab authentic patient cases for this study. We calculated percent agreement and Cohen’s kappa values to assess the agreement between LLMs and human coder. We then identified reasons for discrepancies in code extraction by LLMs in a 10% random subset. Results: Among 50 inpatient notes, human coder extracted 165 unique ICD-10-CM codes. LLMs extracted significantly higher number of unique ICD-10-CM codes than human coder, with Llama 2-70b extracting most (658) and Gemini Advanced the least (221). GPT-4 achieved highest percent agreement with human coder at 15.2%, followed by Claude 3 (12.7%) and GPT-3.5 (12.4%). Cohen’s kappa values indicated minimal to no agreement, ranging from -0.02 to 0.01. When focusing on primary diagnosis, Claude 3 achieved highest percent agreement (26%) and kappa value (0.25). Reasons for discrepancies in extraction of codes varied amongst LLMs and included extraction of codes for diagnoses not confirmed by providers (60% with GPT-4), extraction of non-specific codes (25% with GPT-3.5), extraction of codes for signs and symptoms despite presence of more specific diagnosis (22% with Claude-2.1) and hallucinations (35% with Claude-2.1). Conclusions: Current LLMs have poor performance in extraction of ICD-10-CM codes from inpatient notes when compared against the human coder.



Publication History

Received: 06 June 2024

Accepted after revision: 27 November 2024

Accepted Manuscript online:
28 November 2024

© . Thieme. All rights reserved.

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany