Keywords
nursing notes - i2b2 - discharge summaries - de-identification - transformers - NLP
Background and Significance
Background and Significance
The National Institutes of Health promotes the sharing of scientific data to accelerate discovery, enable external validation of research results, increase accessibility to high-value datasets, and promote data reuse and replicability.[1] To meet these expectations while protecting patient privacy, it requires a deeper understanding of the types and distributions of identifiable data within different unstructured notes from the electronic health records (EHRs).
The widespread adoption of EHRs and advanced health information technologies in the United States has led to the sharing of increasing volumes of complex health data from various sources, and some of these data have become openly available to drive reproducible, open science.[2] Most EHRs contain both structured (well-formatted tables) and unstructured data (free-text, images, etc.) providing valuable information about patients and populations to facilitate knowledge discovery and development of innovative clinical tools.[3] Roughly 80% of the health data stored in EHRs remain unstructured.[4] Compared with structured data from the EHR, these unstructured data are understudied due to methodological challenges in processing and analyzing narrative text that only recent methods have advanced.[5] Sharing of unstructured data (i.e., narrative notes) enables comprehensive analysis of current research findings and methods to ensure reproducibility, reliability, and accountability.[6] Our team has demonstrated the unique value and predictive signals embedded within nursing notes,[7]
[8]
[9] but nursing notes are vastly understudied compared with physician notes, possibly in part due to the lack of access to curated, de-identified datasets of nursing notes. For privacy protection, de-identification of clinical notes per the Health Insurance Portability and Accountability Act (HIPAA) is required prior to making a dataset openly available or direct sharing datasets across different health care institutions and organizations.[10]
[11] De-identification involves removing all personal protected health information (PHI) from health records to protect patient privacy while enabling secondary use of her data.[11] Under the HIPAA specification, two methods for de-identifying PHI are suggested: “Expert Determination” (domain experts certify the risk of leveraging available information to identify an individual by any anticipated recipients is small), and “Safe Harbor” method where clinical records are considered as de-identified when all 18 PHI ([Table 1]) are removed. Biometric identifiers and photographs were not appliable in our study. Practical and efficient methods to reliably remove all PHI types in notes are currently lacking.
Table 1
Types of PHI defined by HIPAA Safe Harbor Legislation (items 16 and 17 were not applicable in our dataset; for the interest of our study, we only focused on the rest of the PHI categories)
PHI Category and Examples (if applicable)
|
1. Names: patient and family member names, health care providers, etc.
|
11. Certificate or license numbers
|
2. All geographic subdivisions smaller than a state: city, street address, county, zip code, etc.
|
12. Vehicle identification
|
3. Dates
|
13. Device identification or serial numbers
|
4. Telephone numbers
|
14. Universal resources locators (URLs)
|
6. Email addresses
|
15. Internet Protocol addresses
|
7. Social security numbers
|
16. Biometric identifiers
|
8. Medical record numbers
|
17. Full-face photographs and comparable images
|
9. Health plan numbers
|
18. Any other unique identifying number, characteristic, or code
|
10. Account numbers
|
Abbreviation: PHI, protected health information.
Automated methods for “Safe-harbor” de-identification methods are preferred given the large volume of notes. Currently, there are two main approaches to performing automated de-identification tasks in clinical notes: (1) rule-based/pattern matching approaches[12]
[13]
[14]
[15]
[16]
[17] applying specialized dictionaries as well as complex hand-crafted regular expression rules, and (2) machine learning (ML) techniques (support vector machines, decision trees, conditional random fields, etc.).[18]
[19]
[20]
[21]
[22] ML techniques generally outperform rule-based methods but involve extensive feature engineering and time-consuming annotation.[23]
[24] Recently, some research studies attempted to tackle this problem by using artificial neural networks (long-short-term memory architect, bidirectional transformer models).[23]
[25]
[26] Some studies[27]
[28]
[29] have explored hybrid models (combination of ML and rule-based), reporting a high average recall rate and accuracy (above 90%) while demanding an exhaustive feature extraction process and being questioned by their generalizability to external sources.[30] However, to make the data publicly accessible, all PHI should be removed as mandated by the law, which necessitates the perfect performance of a system.
Despite large advancements made so far, most research studies in this domain have neglected the distributions in PHI characteristics across various note types which can potentially inform our selection of the desired de-identification algorithms. Clinical notes serve different purposes (e.g., billing, decision-making, care assessment)[31] and exhibit varying content and structure, which implies the necessity for tailored de-identification algorithms and evaluation criteria. Much of the literature[19]
[21]
[29] leveraged specific corpora, like i2b2 (Informatics for integrating Biology and the Bedside) discharge summaries,[32] for training and evaluation but overlooked diverse EHR note types. This may hinder insights into algorithm adaptability across different note types or datasets. Understanding the PHI patterns across note types and examining any types of PHI frequency correlation within the same note type may help in the selection and refinement of de-identification models, as well as development of evaluation schema. In the domain of natural language processing (NLP), pretrained transformer models have emerged as foundational tools, achieving the state-of-art performances on various NLP tasks.[33] One widely recognized transformer includes RoBERTa trained on an extensive text corpus (over 160 GB), exhibiting significant improvement on several benchmarks.[33]
[34] Additionally, ClinicalBERT trained on clinical documentation stands out as a promising representation of clinical notes.[35] Our study aims to examine the generalizability of pretrained transformer models on inpatient nursing notes and explore the variability of PHI distribution between i2b2 discharge summaries[32] and narrative inpatient nursing notes, with the implications for optimizing de-identification processes in nursing notes.
Objectives
In this study, we aim to address the following questions: (1) What is the generalizability of the current pretrained transformer models on different sets of data? (2) How does the pattern distribution of PHI in nursing notes differ from nonnursing clinical notes? (3) How does distribution variability impact our downstream analysis?
Methods
The study entails the following steps: data preprocessing, pretrained model implementation, evaluation, and lastly statistical analysis, shown in [Fig. 1]. The details for each stage are discussed in the subsections below.
Fig. 1 Overview of the study.
Data Source
The main data source consists of a total number of 1,334 raw inpatient nursing notes covering acute care and intensive care units within cardiac, surgical, and neuro specialties between October 2021 and April 2022 at an academic medical center in the Northeast United States as part of the CONCERN study.[9] This study was approved by the Institutional Review Board. The raw nursing notes, comprising both structured and unstructured sections, were processed to extract only narrative components for our study interest. Details were described in a separate publication from our team.[36] Additionally, we also included i2b2 2014 discharge summaries for comparison, where the access can be requested here: https://www.i2b2.org/NLP/DataSets/Main.php. The datasets were manually de-identified and validated by the i2b2 team, and annotated PHI entities were replaced with realistic surrogates.
Data Preprocessing
One fundamental preprocessing step for NLP tasks is tokenization, breaking texts into subunits for input into transformer models. In this step, the raw narrative texts were split into sentences by leveraging spaCy models for biomedical text processing (“en_core_sci_lg”). Some additional custom clinical rules were applied for tokenization to reduce common missing space errors and adapt with some common abbreviations (e.g., R.N., Dr., q.n.) in clinical notes.
Model Implementation
In this study, we deployed two pretrained transformers (RoBERTa,[34] ClinicalBERT[35]) fine-tuned for de-identification tasks using i2b2 2014 discharge summaries.[32] The fine-tuned models were developed by Kailas et al, which can be downloaded from hugging face online: https://huggingface.co/obi. We followed their training instructions[37] and retrained the base models on the i2b2 2014 training set.[32] The training details are summarized in [Table 2]. The trained model can be leveraged for classifying tokens into PHI entities. The steps and additional information of model implementation are shown in [Fig. 2].
Fig. 2 Illustration of de-identification process. The raw nursing note will be split into sentences and tokenized, described in the “Data Preprocessing” section. The input sequence contains a sentence with maximum token length of 128 and 32 tokens on both sides of the sentence for contextual enrichment, which will be passed into the pretrained transformers (RoBERTa, ClinicalBERT) fine-tuned on i2b2 2014 discharge summaries.[32] RoBERTa and ClinicalBERT are both variants of the BERT (Bidirectional Encoder Representations from Transformer) model.[33]
[34]
[35] RoBERTa (Robustly Optimized BERT Pretraining Approach)[34] was trained on much larger datasets (over 160 GB of uncompressed text found on Web sites) compared with BERT base model, and ClinicalBERT[35] was trained on MIMIC-III clinical texts. The two models were fine-tuned on i2b2 discharge summaries for de-identification task which is regarded as name entity recognition. The input sequence will be passed into pretrained de-identification models and output the PHI labels for each token where the label “O” represents non-PHI tokens. PHI, protected health information.
Table 2
Training hyperparameters for RoBERTa and ClinicalBERT models
Hyperparameters
|
Models
|
RoBERTa
|
ClinicalBERT
|
Input length
|
128 tokens
|
128 tokens
|
Batch size
|
32
|
32
|
Optimizer
|
AdamW
|
AdamW
|
Learning rate
|
4e-5
|
5e-5
|
Dropout
|
0.1
|
0.1
|
Evaluation and Statistical Analysis
Evaluations were performed on i2b2 2014 testing set[32] and our data inpatient nursing notes. We manually reviewed the entire inpatient nursing notes alongside model outputs. The predicted PHI labels were grouped into the following 10 categories: (1) Date, (2) Hospital Name, (3) Location (including all addresses), (4) Patient or Care Partner Name, (5) Staff Name, (6) Identification Number, (7) Phone Number, (8) Age, (9) Organization Name, and (10) Email Address. To evaluate the model under the same standard, we annotated our dataset under the i2b2 2014 annotation guideline which was consistent with HIPAA standards plus additional specifications for category Age and Date. The Date PHI category covers any of the calendar dates, years, holidays, and days of the week, including annotation of seasons (“Fall'02”). The Age category includes any ages mentioned, whereas HIPAA requires only ages 90 and above to be annotated. The Hospital Name category pertains to health care organizations. Location category refers to any addresses or components of an address (street, zip code). The Patient Name category refers to either patients' names or their family members' names, while the Staff Name category entails all names of hospital employees. Identification Number category includes all medical record numbers, Social Security Numbers, provider numbers, etc. The Organization Name category encompasses all non-health care delivery organizations (e.g., a patient's employer). The remaining PHI categories are self-explanatory and follow HIPAA's description.
Each model-identified PHI token was labeled as TP (true positive if correctly categorized), FP (false positive if misidentified or misclassified), or FN (false negative for missed PHI token). We computed the performance metrics (precision, recall, and F1 score) under the token level to evaluate the generalizability of the pretrained de-identification models. In this task, it is inconsequential whether PHI was detected as separate tokens or combined. For example, tokens “James” and “Smith” could be the same as “James Smith” entity if their PHI labels are identical. A one-sided Mann–Whitney test was applied to compare PHI distributions of PHI in discharge summaries with those in inpatient nursing notes. To mitigate the sample size bias, we randomly sampled inpatient nursing notes to match the size of the i2b2 discharge summary (testing set). Lastly, we conducted the error analysis on the model with the highest F1 score, where error cases were categorized based on shared characteristics and patterns.
Results
Data Description
The raw dataset consisted of 1,334 raw nursing notes with a total number of 711,829 whitespace-separated tokens. The median number of PHI in a nursing note was 1, with an interquartile range (IQR) of 3. Out of these 1,334 nursing notes, 38.8% of them (n = 519) did not contain any of the PHI specified under HIPAA standards. Among the rest of the other notes, the median value of PHI was 2 (IQR: 4). A total number of 3,336 PHI instances were found, which were grouped into 10 categories (see “Methods” section).
Evaluation
Macro-average F1 score across PHI categories was calculated and recorded in [Table 3]. For the PHI binary task which classifies tokens into either PHI or non-PHI, the model RoBERTa achieved a F1 score of 0.932, approximately 4.6% lower than the baseline model evaluated on i2b2 test set. In our dataset, there were 160 instances misclassified as PHI by the RoBERTa model, and the optimal model (RoBERTa) failed to capture 57 PHI instances. Additionally, in the PHI token classification task, the best F1 score was 0.877 measured across all PHI categories, whereas in i2b2 test set, the model was able to achieve a F1 score of 0.956. The total number of error cases found in our nursing notes was 542, where 89.5% of them were either misclassification of PHI into the wrong category or capture of non-PHI tokens. In general, the RoBERTa model outperformed the ClinicalBERT model on our dataset across most of the PHI categories. The precision and recall were generally higher across all datasets (i2b2, nursing notes) for the RoBERTa model ([Table 4]).
Table 3
Trained RoBERTa and ClinicalBERT model evaluated on i2b2 test dataset and inpatient nursing notes under F1-measure
Models
|
F1-measure in PHI binary task
(PHI vs. non-PHI)
|
Macro-averaged F1-measure overall PHI categories
|
RoBERTa
(i2b2 test)
|
0.978
|
0.956
|
RoBERTa
(inpatient nursing notes)
|
0.932
|
0.887
|
ClinicalBERT
(i2b2 test)
|
0.963
|
0.820
|
ClinicalBERT
(inpatient nursing notes)
|
0.834
|
0.615
|
Abbreviation: PHI, protected health information.
Table 4
Precision and recall across PHI categories evaluated in inpatient nursing notes
PHI category
|
Model
|
Precision
|
Recall
|
Patient, or care partner name
|
RoBERTa
|
0.902
|
0.997
|
ClinicalBERT
|
0.766
|
0.957
|
Staff name
|
RoBERTa
|
0.951
|
0.973
|
ClinicalBERT
|
0.812
|
0.905
|
Age
|
RoBERTa
|
0.998
|
0.928
|
ClinicalBERT
|
0.964
|
0.949
|
Date
|
RoBERTa
|
0.992
|
0.997
|
ClinicalBERT
|
0.964
|
0.967
|
Phone number
|
RoBERTa
|
0.922
|
0.941
|
ClinicalBERT
|
0.367
|
0.333
|
Email address
|
RoBERTa
|
1.0
|
1.0
|
ClinicalBERT
|
0.0
|
0.0
|
Location
|
RoBERTa
|
0.903
|
0.984
|
ClinicalBERT
|
0.708
|
0.903
|
Hospital[a] name
|
RoBERTa
|
0.666
|
0.981
|
ClinicalBERT
|
0.323
|
0.824
|
Identification number
|
RoBERTa
|
0.118
|
0.966
|
ClinicalBERT
|
0.105
|
0.917
|
Organization[b] name
|
RoBERTa
|
0.785
|
0.944
|
ClinicalBERT
|
0.323
|
0.759
|
Abbreviation: PHI, protected health information.
Note: Higher precision and recall are highlighted in bold.
a All health care delivery organizations.
b Non-health care organizations.
Statistical Analysis
The PHI distribution among i2b2 discharge summaries and multi-site nursing notes is summarized in [Table 5], outlining the counts for each PHI category, along with the proportion of all numbers of PHI in the respective dataset. The dominant PHI category in both i2b2 discharge summaries and inpatient nursing notes was Date, followed by health care providers or hospital staff names. Furthermore, Email Address had the lowest number of occurrences in both data sources. The proportions of occurrences in the Location and Hospital categories were similar across both datasets. The table suggested different levels of PHI exposure across different clinical documentations.
Table 5
Summary of PHI entities across datasets
PHI category
|
i2b2 discharge summaries[32]
(n = 514)
|
Inpatient nursing notes
(n = 1,334)
|
All PHI
|
11,283
|
3,336
|
Names
|
Patient, family member names
|
879 (7.79%)
|
424 (12.71%)
|
HCPs or any hospitals/organization staff names
|
2,004 (17.76%)
|
476 (14.27%)
|
Ages (all ages being mentioned)
|
764 (6.77%)
|
459 (13.76%)
|
Date (all calendar dates, years, holiday, etc.)
|
4,980 (44.14%)
|
890 (26.68%)
|
Contact
|
Phone/fax numbers
|
217 (1.92%)
|
347 (10.40%)
|
Email
|
1 (0.01%)
|
7 (0.2%)
|
Locations (street, city, state, zip code, etc.)
|
856 (7.59)
|
329 (9.86%)
|
Hospitals (hospital names/abbreviations, pharmacy names, etc.)
|
875 (7.76%)
|
277 (8.30%)
|
IDs (SSNs, driver licenses, MRNs, translators/nurses/doctors IDs)
|
625 (5.54%)
|
40 (1.20%)
|
Organizations (non-health care organizations)
|
82 (0.73%)
|
87 (2.61%)
|
Abbreviations: HCP, health care provider; PHI, protected health information.
The visualization of frequency correlation between PHI groups is depicted in [Fig. 3], where the stronger the correlation between variables, the darker the color exhibited. There appears to have some strong pair-wise correlations (>0.5) between the frequency of patients' names or their care partners' (e.g., friend, family) names and phone/fax numbers within our dataset. Therefore, given a name from either a patient or a care partner is detected in a note, there might exist a strong possibility of a phone number also being included in the same note. On the contrary, no similar frequency correlation was perceived in the i2b2 discharge summaries.
Fig. 3 Correlation frequency pair plot for PHI types. PHI, protected health information.
The pattern visualization along with statistical summarization of PHI distributions in the i2b2 corpus and inpatient nursing notes is demonstrated in [Fig. 4]. The distribution heatmaps (a, b) highlighted a consistently higher sparsity in the types of PHI found in inpatient nursing notes when compared with discharge summaries. In the i2b2 corpus, each note contains a substantial number of PHI instances (median:18, IQR: 14) and PHI types (median: 6, IQR: 1), encompassing at least two distinct PHI categories (e.g., Date, Staff Name), whereas inpatient nursing notes were found to contain a few PHI instances (median: 1, IQR: 3) and a limited number of PHI types (median:1, IQR: 3). The results of the one-sided Mann–Whitney test for each PHI category are shown in [Table 6]. Overall, i2b2 2014 discharge summaries entailed a statistically greater number of PHI instances in 9 out of 10 PHI categories compared with inpatient nursing notes. However, no statistical significance was found in the category of Email.
Table 6
One-sided statistical testing result, displaying the average counts per note in i2b2 2014 discharge summaries (n = 514) and randomly sampled inpatient nursing notes (n = 514)
PHI category
|
i2b2 2014[32]
|
Inpatient nursing notes
|
p-Value
|
Date
|
9.689
|
0.667
|
p < 0.001[a]
|
Hospital name
|
1.702
|
0.208
|
p < 0.001[a]
|
Location
|
1.665
|
0.247
|
p < 0.001[a]
|
Patient or care partner names
|
1.71
|
0.319
|
p < 0.001[a]
|
Staff name
|
3.90
|
0.357
|
p < 0.001[a]
|
Identification number
|
1.216
|
0.030
|
p < 0.001[a]
|
Phone number
|
0.422
|
0.260
|
p < 0.001[a]
|
Age
|
1.487
|
0.344
|
p < 0.001[a]
|
Organization name
|
0.160
|
0.065
|
p < 0.001[a]
|
Email address
|
0.005
|
0.0019
|
0.5
|
Abbreviation: PHI, protected health information.
a Represents the p-value is below the significant level (0.01), reflecting a statistically significant difference was detected.
Fig. 4 Pattern comparison of PHI distribution in discharge summaries and inpatient nursing notes. Visualization of PHI distribution for discharge summaries (A) and inpatient nursing notes (B). The plots (C and D) displayed the median (displayed as circle) and the ranges (min, max) of PHI counts for discharge summaries (C) and inpatient nursing notes (D). PHI, protected health information.
Qualitative Error Analysis
To further understand the cases that the model (RoBERTa) failed to correctly classify, we grouped these error cases by common characteristics (see [Table 7]). The total number of error cases was 542 in the 1,334 inpatient nursing notes, where around 89% of the errors were either labeling non-PHI tokens or misclassification of PHI tokens to incorrect categories. Among false-negative errors, ages were the most common PHI that the model failed to capture (n = 33), which was attributed to missing space between tokens or tokenization errors. Among the age error cases, two cases of age entity were found to be above 89. The primary factors contributing to misidentification were parsing and tokenization errors. As for false-positive cases, approximately 33% of them (n = 160) mistagged non-PHI tokens due to the failure to recognize clinical domain terms or common word usage within this medical context. For instance, the term “G2P1002” in the sentence “the patient G2P1002 (2 pregnancy, 1 full term birth, 2 kids at home = twins) with recent admission for possible ectopic pregnancy” was mistagged as ID category. However, the clinical term refers to the pregnancy and birth history, and number of current children. The model also showed a strong tendency to misclassify PHI tokens into incorrect categories (n = 485, 89.5%), especially labeling phone/fax numbers as IDs (n = 212, 39.1%).
Table 7
Error analysis of the optimal model in inpatient narrative nursing
Errors
|
Category and potential reasons
|
Number of cases
|
False negative
(n = 57)
|
Age
|
|
Missing space/tokenization errors
|
33
|
Hospital
|
|
Hospital names' abbreviations not recognized
|
4
|
Undetermined[a]
|
2
|
Phone/fax numbers
|
|
Parsing errors
|
4
|
Undetermined[a]
|
2
|
Staff
|
|
Undetermined[a]
|
4
|
Punctuations/tokenization errors
|
1
|
Other PHI (location, date, organization, ID)
|
7
|
False positive
(n = 485)
|
Tagging non-PHI tokens as PHI
|
|
Clinical terms/acronyms not recognized
|
136
|
Undetermined[a]
|
24
|
Classify PHI tokens to incorrect PHI category
|
|
Phone/fax numbers and IDs confusion
|
212
|
Misclassifying care partners of patients to staff
|
35
|
Misclassifying organizations into hospital PHI group
|
32
|
Staff names and hospital names confusion
|
12
|
Other scenarios
|
34
|
Abbreviation: PHI, protected health information.
a No explicit reason why the algorithm misclassifies entities.
Discussion
This study leveraged two pretrained transformer models fine-tuned on i2b2 discharge summaries to evaluate the generalizability of narrative inpatient nursing notes and to examine the disparity of PHI distribution between discharge summaries and nursing notes. As shown, the performance of the optimal model achieved a F1 score of 0.887 across PHI categories and 0.932 in PHI binary task. Through conducting the error analysis, the algorithm missed to recognize 57 PHI instances. Of these 57 false-negative cases, age PHI accounted for 57.9% (n = 33). Improving spelling and token spacing correction could potentially enhance PHI detection, particularly for patient ages. However, it is important to note that under HIPAA regulation, only ages above 89 are considered PHI, and within the age error cases (n = 33) we detected, only two age instances met these criteria, where the evaluation outcomes may be differed if considering only ages above 89 as PHI.
Additionally, we observed a large percentage of errors in labeling PHI in the wrong category, such as predicting phone numbers as IDs, or vice versa. Despite these errors, in such cases, the overall message or main idea expressed by the data may still be preserved and remain accurate. For instance, the original text could be “the patient 123-123-1234 was diagnosed with diabetes” and the de-identified text >the patient <ID“ was diagnosed with diabetes,” which does not alter the central semantic meaning of the sentence even if the phone number was mislabeled as ID. Another major issue of the model entails the failure of comprehending clinical terms and/or acronyms, suggesting that relevant medical resources such as Unified Medical Language System can be integrated into the process. In a practical application, while striking for the balance of precision and recall, we should prioritize minimizing the risk of exposure to PHI.
Another objective of this study was to examine the PHI distribution variability between discharge summaries and narrative nursing notes. Few studies[38] to our knowledge have investigated the generalizability of a trained de-identification algorithm on inpatient narrative nursing notes. We grouped PHI into 10 categories as outlined in the “Methods” section. We demonstrated that while nursing notes and discharge summaries contain similar PHI types (patient names, addresses, contact information, etc.), discharge summaries contain a significantly higher number of PHI per note with respect to nursing narrative notes. Inpatient nursing notes offer more concise, daily patient updates, facilitating professional communication, whereas discharge summaries tend to follow a standardized format, ensuring clarity for nonprofessionals.[39] Thus, there might be some standards regarding what PHI data are required to include in discharge summaries and where we may expect to see them within the note, such as discharge/visit dates occurring on top, and doctors' names signed at the end of the confirmation of discharge. In contrast, PHI in narrative nursing notes was found to have greater variability and relate to a broader set of factors about the patient's personal life. Additionally, as compared with the reported results from MIMIC II (Medical Information Mart for intensive Care II) nursing progress notes,[13] we also perceived differences in some PHI distributions (e.g., location, IDs), suggesting the variability of nursing notes across different clinical units and institutions and that more representative samples are required when training de-identification algorithms. The variability observed in PHI distribution among different data sources and sites has important implications for designing and evaluating automated de-identification systems for nursing notes, which currently remains under-studied.
One of the state-of-art algorithms in de-identification Philter (Protected Heath Information filter) leveraged overlapping pipelines of multiple methods entailing pattern matching, statistical modeling, and blocked lists, among others.[34] This approach achieved an impressive recall score of 99.92% on i2b2 2014 discharge summaries but came with a slight compromise of precision (78.58%). Another study utilized XLNet (transformer-XL model) fine-tuned on i2b2 discharge summaries and reported an F1 score of 0.96. To publish clinical notes, it is mandated by law that the dataset is certified as free from any information that could reveal an individual's identity, which essentially requires perfect performance for a fully automated de-identification system. Yet, optimization of the architecture for the types of models such as transformer models used in de-identification rarely achieves perfection when transferring to different data sources.
On March 14, 2023, Open AI released GPT-4, a significant advancement in large language models.[40] Although models like RoBERTa and ClinicalBERT can attain high accuracy rate over 90%, they typically demand intensive coding efforts and significant time expenses. Instead, GPT provides a rapid and more accessible approach.[41] One research team achieved 99% accuracy using GPT on i2b2 dataset with optimize prompt engineering.[41] However, there are some real-world constraints imposed on the use of GPT for de-identification of clinical notes. Currently, only synthetic medical data can be passed into GPT due to privacy concerns, and insufficient synthetic nursing notes also impede the evaluation of GPT in de-identification task. It is critical to prioritize security and privacy in handling medical data for GPT, underlining the need for the development of relevant mechanisms and regulations.
Alternatively, a more appliable and efficient approach could involve additional implementation of simple processes (pattern-searching) to ensure complete de-identification before sharing a dataset. Another approach that could be layered on top of an automated de-identification system is the Hiding In Plain Sight (HIPS)[42] approach to address the residual identifier problem. Conventional de-identification algorithms replace PHI with entity placeholders, while HIPS replaces the detected identifiers with realistic but synthetic surrogates, making it challenging to distinguish from the original ones.[42]
[43] The HIPS technique could be leveraged as the safety net to effectively reduce the exposure of PHI. Fully automated algorithms for de-identification require substantial resources, but semi-automated methods (de-identification ML models followed by minor human inspection) plus the HIPS technique offer an effective and feasible alternative option.
Our study consists of several limitations. The data included in our study are from academic medical centers, possibly limiting their generalizability to nonacademic settings. Further exploration of PHI distribution across diverse nursing notes and clinical units is warranted. Despite rigorous de-identification following the HIPAA regulations, there remains a 0.04% risk of reidentification using basic demographics (gender, race).[44] Thus, in the future studies, it is imperative to reassess privacy protection standards to safeguard individuals' identities.
Conclusion
To conclude, we evaluated two de-identification transformers (RoBERTa, ClinicalBERT) on nursing notes and compared their PHI usage with discharge summaries. Discharge summaries contained more PHI instances and types, while narrative nursing notes exhibited high variability in the types of PHI present. Openly sharing datasets is an important part of open science, yet techniques for effective removal of all PHI in nursing notes need further exploration. Understanding the PHI distribution will help to select the appropriate algorithm(s) and develop a customized evaluation schema for PHI.
Clinical Relevance Statement
Clinical Relevance Statement
The study focused on evaluating the generalizability of state-of-art models on a different clinical data source and examining the PHI pattern distributions between discharge summaries and inpatient nursing notes. The knowledge gained from this study can enhance both the design and selection of algorithms and offer insights toward developing appropriate evaluation schema for PHI.
Multiple Choice Questions
Multiple Choice Questions
-
According to HIPAA regulation, which of the following data elements can be retained in the raw text during the de-identification process?
-
Patient's residential address including specific street name, zip code, and country
-
Patients' first and last name
-
Patient's contact email
-
Patient's description of a common medical condition
Correct answer: The correct answer is option d. Only data elements that do not entail direct identifiers or any information that could identify specific individuals can be retained in their original form. Choices a, b, and c are directly linked to an individual's identity. As part of the regulation, such entities (residential addresses, names, and contact information) are either removed from the text or replaced by the corresponding surrogates. Choice d referring to the description of a common medical condition does not disclose any specific individual or small group of people.
-
What is the primary purpose of clinical data de-identification?
-
To optimize the clinical workflows
-
To improve the accuracy of disease diagnoses
-
To protect patient privacy
-
To reduce data size for storage
Correct answer: The correct answer is option c. The primary goal of clinical data de-identification is the protection of patients' health care data ensuring data security and privacy. It is achieved by removing or anonymizing all protected health information elements. Only choice c is consistent with the primary purpose of clinical data de-identification.
-
Which of the following statements is true about clinical data that has been de-identified automatically by a system with the previously reported accuracy of 99%?
-
It can only be shared directly with some affiliated organizations without any privacy consideration.
-
It requires further human validation and examination of any re-identification risks.
-
It is exempt from data protection regulations.
-
It can be released publicly with no restrictions.
Correct answer: The correct answer is option b. Even if the data have been de-identified by a robust system, we still need to perform validation to ensure the removal of all PHI in the data before it can be shared with any organizations. Additionally, we need to check for pieces of information left that can be combined and potentially be used to re-identify individuals.