Thromb Haemost
DOI: 10.1055/a-2796-1975
Original Article

Optimizing the Accuracy of Natural Language Processing Tools for Pulmonary Embolism Detection Through Integration with Claims Data: The PE-EHR+ Study

Authors

  • Sina Rashedi

    1   Thrombosis Research Group, Brigham and Women's Hospital, Boston, United States (Ringgold ID: RIN1861)
  • Syed Bukhari

    2   Division of Cardiology, Johns Hopkins University, Baltimore, United States (Ringgold ID: RIN1466)
  • Darsiya Krishnathasan

    1   Thrombosis Research Group, Brigham and Women's Hospital, Boston, United States (Ringgold ID: RIN1861)
  • Candrika D Khairani

    1   Thrombosis Research Group, Brigham and Women's Hospital, Boston, United States (Ringgold ID: RIN1861)
  • Antoine Bejjani

    1   Thrombosis Research Group, Brigham and Women's Hospital, Boston, United States (Ringgold ID: RIN1861)
    3   Department of Internal Medicine, University of Pittsburgh Medical Center Health System, Pittsburgh, United States (Ringgold ID: RIN6595)
  • Mariana B. Pfeferman

    1   Thrombosis Research Group, Brigham and Women's Hospital, Boston, United States (Ringgold ID: RIN1861)
  • Julia Malejczyk

    1   Thrombosis Research Group, Brigham and Women's Hospital, Boston, United States (Ringgold ID: RIN1861)
  • Mehrdad Zarghami

    4   Medicine, Jamaica Hospital Medical Center, Jamaica, United States (Ringgold ID: RIN4132)
  • Eric Secemsky

    5   Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital Department of Medicine, Boston, United States (Ringgold ID: RIN370908)
    6   Department of Medicine, Beth Israel Deaconess Medical Center Richard A and Susan F Smith Center for Outcomes Research in Cardiology, Boston, United States (Ringgold ID: RIN569904)
    7   Harvard Medical School, Boston, United States (Ringgold ID: RIN1811)
  • Farbod N. Rahaghi

    8   Division of Pulmonary and Critical Care, Brigham and Women's Hospital Department of Medicine, Boston, United States (Ringgold ID: RIN370908)
  • Mohamad Hussain

    9   Division of Vascular and Endovascular Surgery and the Center for Surgery and Public Health, Brigham and Women's Hospital, Boston, United States (Ringgold ID: RIN1861)
  • Hamid Mojibian

    10   Department of Radiology and Biomedical Imaging, Yale University, New Haven, United States (Ringgold ID: RIN5755)
  • Samuel Goldhaber

    1   Thrombosis Research Group, Brigham and Women's Hospital, Boston, United States (Ringgold ID: RIN1861)
    11   Division of Cardiovascular Medicine, Brigham and Women's Hospital, Boston, United States (Ringgold ID: RIN1861)
  • David Jiménez

    12   Respiratory Division, Medicine Department, Ramón y Cajal Hospital, IRYCIS and Alcalá de Henares University, Madrid, Spain
    13   Medicine Department, Universidad de Alcalá, Alcala de Henares, Spain (Ringgold ID: RIN16720)
    14   CIBER Enfermedades Respiratorias (CIBERES), Instituto de Salud Carlos III, Madrid, Spain (Ringgold ID: RIN38176)
  • Manuel Monreal

    15   Catedra de Enfermedad Tromboembolica, Universidad Catolica San Antonio de Murcia Facultad de Ciencias de la Salud, Barcelona, Spain (Ringgold ID: RIN334252)
  • Richard Yang

    16   Yale School of Medicine Center for Outcomes Research & Evaluation, New Haven, United States (Ringgold ID: RIN458548)
  • Li Zhou

    16   Yale School of Medicine Center for Outcomes Research & Evaluation, New Haven, United States (Ringgold ID: RIN458548)
  • Gregory Piazza

    1   Thrombosis Research Group, Brigham and Women's Hospital, Boston, United States (Ringgold ID: RIN1861)
    11   Division of Cardiovascular Medicine, Brigham and Women's Hospital, Boston, United States (Ringgold ID: RIN1861)
  • Harlan Krumholz

    16   Yale School of Medicine Center for Outcomes Research & Evaluation, New Haven, United States (Ringgold ID: RIN458548)
    17   Section of Cardiovascular Medicine, Yale School of Medicine Department of Internal Medicine, New Haven, United States (Ringgold ID: RIN156178)
    18   Department of Health Policy and Management, Yale School of Public Health Department of Health Policy & Management, New Haven, United States (Ringgold ID: RIN196228)
  • Liqin Wang

    5   Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital Department of Medicine, Boston, United States (Ringgold ID: RIN370908)
  • Behnood Bikdeli

    1   Thrombosis Research Group, Brigham and Women's Hospital, Boston, United States (Ringgold ID: RIN1861)
    11   Division of Cardiovascular Medicine, Brigham and Women's Hospital, Boston, United States (Ringgold ID: RIN1861)
    19   Yale School of Public Health Department of Health Policy & Management, New Haven, United States (Ringgold ID: RIN196228)

Supported by: American Heart Association 938814

Background Rule-based natural language processing (NLP) tools can identify pulmonary embolism (PE) via radiology reports. However, their external validity remains uncertain. Methods In this cross-sectional study, 1,712 hospitalized patients (with and without PE) at Mass General Brigham (MGB) hospitals (2016–2021) were analyzed. Two previously-published NLP algorithms were applied to radiology reports to identify PE. Chart review by two physicians was the reference standard. We tested three approaches: (A) NLP applied to all patients; (B) NLP limited to radiology reports of patients with principal or secondary International Classification of Diseases 10th revision (ICD-10) PE discharge codes; and (C) NLP applied to patients with PE discharge codes or a Present-on-Admission (POA) indicator (“Y”) for PE. All others were assumed PE-negative in Approaches B and C to minimize NLP false positives. Weighted estimates were derived from the MGB hospitalized cohort (n=381,642) to calculate F1 scores (as the harmonic mean of sensitivity and positive predictive value (PPV)). Results In Approach A, both NLP tools showed high sensitivity (82.5%, 93.0%) and specificity (98.9%, 98.7%) but low PPV (60.3%, 59.6%). Approach B improved PPV (95.2%, 94.9%) but reduced sensitivity (74.1%, 76.2%), while Approach C preserved both high sensitivity (82.5%, 93.0%) and PPV (95.6%, 95.8%). Approach C demonstrated the best performance, yielding significantly higher F1 scores for both NLP tools (88.6%, 94.4%) compared with Approach A (69.7%, 72.6%) and Approach B (83.3%, 84.5%) (P<0.001). Conclusion The accuracy of PE detection improves when rule-based NLP algorithms are operationalized using administrative claims data in addition to radiology reports.



Publication History

Received: 20 August 2025

Accepted after revision: 23 January 2026

Accepted Manuscript online:
28 January 2026

© . Thieme. All rights reserved.

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany