Keywords
data quality - electronic health record - secondary use - real-world data
Background and Significance
Background and Significance
Medical real-world data (RWD) is increasingly used for research studies. Electronic medical records (EMRs) and claims data are major sources, with more focused data repositories such as tumor registries and clinical departmental systems also being mined. It is important that users of RWD are aware of the limitations on a dataset that may not be explicitly apparent. For instance, given the various sources of patient medications—inpatient and outpatient hospital-based dispensing, community pharmacies—the completeness of a dataset's medication data may not be evident.[1]
Other RWD limitations arise from patient privacy regulations.[2] Access to these data is governed by the Health Insurance Portability and Accountability Act (HIPAA), which requires either that research access be approved by an institutional review board (IRB) or that the data be deidentified per the HIPAA privacy rule.[3]
[4]
[5] To be deidentified, all protected health information (PHI) must be removed from a dataset. PHI includes direct identifiers such as name, address, social security number, and medical record number. Also, the dates can only be specified by a year. Compliance with this restriction on dates poses both technical and analytical problems. Database date formats typically require specifying a month and day as well as the year. If this technical problem is addressed by setting all dates to the same arbitrary day and month, such as January 1, then all temporal relationships among observations within a given year are destroyed (as they would be if only the year was specified).
As an alternative to deidentified data, HIPAA also provides for “limited datasets” (LDS) in which direct identifiers (such as name and address) cannot be included, but actual complete dates for patient encounters may be used. IRB approval must be obtained for deployment of LDS, and justification for complete dates be provided. Many institutions employ LDS in their research clinical data repositories. However, even though LDS allow full actual dates, many health care organizations (HCO) still alter these dates as an additional measure to increase the protection of patient privacy, retaining temporal relationships but not actual temporal values.
Shifting of these encounter dates became a common method employed by these institutions. This approach aims to preserve the temporal integrity of a patient's chart by consistently shifting all the dates in the chart by the same randomly chosen number of days.[6]
[7] While the number used to shift data is random among patients, it is constant within a single patient's data. So, for a specific patient, a date-shifted prostate biopsy precedes an initial radiation treatment by the same number of days as in the actual data, whereas the deidentified dates differ from the actual dates. Typically, dates are shifted randomly anywhere from ± 7 to ± 365 days. This algorithm of shifting dates by a constant random value for a given patient, but varying the constant value between patients, up to some maximum number of days, is the common method used across multiple institutions. Only one variation was observed in which a site only selected random multiples of 7 days to shift dates. This preserved the day of the week, which is potentially of value for certain studies. No evidence of other algorithms for shifting were observed in the sites we evaluated.
The primary motivation for date shifting in the United States is HIPAA's base requirement that dates have only the year specified. Adherence to this for encounter dates is clearly unsuitable for many studis. Researchers and IRB panels may view limited datasets, in which encounter dates can have their actual values, as posing too much of a risk to patient privacy. It may be that some researchers and panel members may not even be aware of the possibility of using limited datasets. And in some instances, date shifting may be seen as added assurance of maintaining patient privacy, although the costs to studis where actual dates are necessary are likely not considered.
It should be noted that the ability of date shifting to protect patient privacy has been questioned.[8]
[9] Dates can be inferred, and patients identified by observing clinical content or family relationships. A malevolent attempt to uncover patient identities, or even one motivated by curiosity, is now punishable by law and is grounds for job dismissal at most institutions. Of course, it can be argued that obscuring dates certainly protects against technically unsophisticated individuals attempting to identify patients and is definitely of value when a data breach has occurred.
While the efficacy of date shifting to protect patient privacy is defensible, there are situations in which date shifting significantly detracts from the usefulness of RWD, even making it altogether unusable for particular research purposes. These are instances where having access to actual dates is more important than a temporal relationship among a patient's clinical encounters. An example of a need for actual dates is a study requiring an accurate number of patients initially diagnosed with liver cancer in the calendar year 2017. It cannot be assumed that random date shifting results in an equal number of patients being shifted in and out of the date range being studied. Another example are studies that may need to know the number of coronavirus disease 2019 (COVID-19) patients diagnosed within a specific time period given the peculiarities of timelines associated with viral variant emergence and approvals of treatments and vaccines.[10]
[11] The fact that within a patient's data, temporal relationships have been preserved is irrelevant. These studies will not yield valid results if the dates within the real-world dataset have been shifted.
Objectives
Regardless of whether date shifted RWD may or may not be appropriate for a given study, researchers should be aware of whether the dates of a dataset being used have been altered. Perhaps surprisingly, the authors experience with datasets from multiple institution has been that this information is not easy to obtain. Not only is it not always readily available, but also it may not even be known by the current owners of the dataset. Because it is not uncommon for reported date shifting status to be unreliable or difficult to obtain, in this study we set out to develop analytic metrics to detect the presence of date shifting and estimate its maximum magnitude in a given dataset. We have defined and evaluated the reliability of several such metrics. We propose that this methodology be used by data analysts to ascertain whether a dataset they are working with contains actual or altered dates. Knowing this is of crucial importance when a dataset is being used for such use cases as infectious disease research. This methodology is meant to inform an analyst of the presence as well as an approximate magnitude of a date shift. It is not intended to, nor is it capable of, deriving actual dates in date-shifted data.
Methods
Feasibility of Detection
Given that certain studies require actual dates, it is crucial to know whether a particular dataset contains actual or shifted dates before utilizing it. This might be learned from a dataset's provenance, if available. But even if it is, it should be verified. A reliable method for detecting possible shifted dates, therefore, is required prior to any analysis of data.
Our first approach to creating a date shifting detection method was checking a dataset to see whether a specific occurrence of medical significance occurred on the expected date. A recent event presents a useful example: on March 19, 2020, then President Trump declared hydroxychloroquine—an anti-malarial drug treatment—a “game changer” in the fight against COVID-19.[12] Many datasets in the United States subsequently showed a pronounced spike in the volume of hydroxychloroquine use ([Fig. 1]). This suggested that “sharp” temporal events—hydroxychloroquine spike and others like it—can serve as markers to detect the presence of date shifting. In data where dates have been shifted, a sharp spike may disappear, or appear in the wrong month, indicating that date shifting has occurred.
Fig. 1 Patients treated with hydroxychloroquine show a spike at a large academic institution in 2020.
However, the hydroxychloroquine spike, while a great illustration of a useful temporal marker, has a limited applicability; only datasets with medication information and preferably well-represented outpatient coverage would be expected to demonstrate this feature. Of course, the dataset also must cover early 2020 to observe this event.
Using other possible sentinel dates entail similar potential pitfalls. The date of the chosen event may not be necessarily as unique as thought. For example, if one found the earliest instance of International Classification of Diseases (ICD)-10-CM coding as an indication of the start of the use of ICD-10 as opposed to ICD-9, it ignores institutions that delayed implementation of the new coding after its official start. Also, it would not work at all for datasets in which all prior ICD-9 coding was translated to their ICD-10 equivalent.
In addition to a singular event indicating date shifting, several other temporal features which could give an indication of date shifting were studied. An obvious candidate was the occurrence of seasonal medical events, such as influenza diagnosis or heat stroke. One would expect surges in these diagnoses during appropriate months (fall and winter or summer, respectively, for these examples). This approach of looking for increased numbers of certain medical diagnoses appearing in expected months led to yet another method: the day-of-the-week differences in the occurrence of medical procedures that would not typically occur on weekends, such as routine physicals and elective medical procedures.
Synthetic Models
To detect potential shifts of different magnitudes, we wanted to simulate how various temporal “markers” would appear when dates were shifted by various amounts. We hypothesized that small magnitude date shifts would obscure high-frequency (e.g., weekly) temporal events, whereas large magnitude shifts would be necessary to obscure events that happen only infrequently (e.g., once a year). We studied three date patterns:
-
A pattern that occurs with high frequency is the drop in volume of observations during weekends. For example, in the United States, it is very unlikely that an annual physical would be scheduled on a Sunday.
-
A seasonal pattern where a disease is more prominent in the summer or winter.
-
A one-time drop, as in the case of patients postponing elective care at the beginning of the COVID-19 pandemic.
For each of these scenarios, we modeled the behavior of these patterns when shifting dates by various amplitude (specifically, the maximum amplitude in the common date shifting algorithm as described above) using synthetic data designed to exhibit the given pattern. This allowed us to calibrate and understand, for example, how much shift is needed to obscure a weekday/weekend or a seasonal or yearly pattern. We later used this information to compare to the observed patterns in real-world dataset and judged the possible presence of each degree of shift scenario. In [Fig. 2] we see the pattern observed for an encounter, such as a checkup, which normally occurs only Monday through Friday. When the synthetic data are not shifted (i.e., number of days shifted = 0) we see almost no counts on Sunday and Saturday, with any small counts attributable most likely to data entry error or special screening events. As the number of days shifted assume greater values (± 3, ± 7, ± 30, ± 90, ± 365), we see a smoothening of the encounter counts over the days of the week such that there is little difference between the weekend and weekdays. To an analyst ignoring a potential date shift, this would erroneously imply that the checkups occur as likely on the weekend as during the week.
Fig. 2 Weekly pattern using synthetic data. The number above each graph indicates the number of days shifted.
[Fig. 3] shows the distortion seen when data with a one-time drop caused by a sentinel event in certain encounters are date shifted. In the synthetic dataset an abrupt drop in months 16 through 18 is totally obscured as shifting values are increased.
Fig. 3 One-time drop caused by a sentinel event using synthetic data. The number above each graph indicates the number of days shifted.
In [Fig. 4] we see the effect of date shifting on synthetic data for seasonal encounters such as influenza. Once again, increasing the number of days shifted obscures peaks actually observed during certain seasons.
Fig. 4 Yearly pattern of seasonal events using synthetic data. The number above each graph indicates the number of days shifted.
Our experimentation with synthetic models confirmed that date shifting affects the appearance of expected temporal patterns such as volume fluctuations between weekdays and weekends, significant events, such as the COVID-19 pandemic, and seasonal disease patterns. Furthermore, they demonstrated that the maximum magnitude of date shift affects these patterns in a predictable manner, obscuring them proportionally to the magnitude of the shift.
Real-World Data
The TriNetX Global Network of data from health care organizations were utilized for this study.[13] The patient data available is from EMR systems, tumor registries, and departmental systems (e.g., pathology). We hypothesized that observations of the following medical encounters for the health care organizations in our study would exhibit seasonal patterns when date-shifting was not employed. The ICD-10-CM codes for these encounters are noted in parentheses:
Sunburn, heatstroke, URI, and influenza all should have a strong yearly pattern, meaning a few months of very high volume, a few months of a shoulder season, and otherwise relatively low volume. If this pattern is not observed, it can be assumed the HCO is shifting, perhaps by as many as 365 days. To evaluate whether this pattern was present we considered only sites with at least 500 observations of each of these specific diagnoses and determined the distribution of these encounters by computing monthly sums (ignoring the year) and then the median of the monthly sums. If a site exhibited at least 2 months with the number of encounters for a diagnosis exceeding 1.5 times the median value, then we concluded the data were not shifted. Correspondingly, if the monthly encounter numbers were more uniformly distributed, the site was deemed to have provided date-shifted data.
We further hypothesized that nonacute visits, such as routine physical encounters or encounters for dermatitis, would follow a Monday–Friday occurrence, with infrequent weekend occurrences in unshifted data:
-
Routine physical (Z00)
-
Dermatitis (L20–L30)
Routine checkups and treatment for dermatitis would have likely been postponed during March and April 2020 due to COVID-19. Once again, only sites with at least 500 observations for routine checkups and/or dermatitis treatment were considered. We computed the median of monthly encounter sums and concluded that any site having a decrease in such encounters of at least 80% in March 2020 and/or at least 50% in April 2020 had not date shifted their data. Otherwise, a shift of 30 days or more would be assumed. We selected these thresholds because we hypothesized that these values would capture substantial changes and not minor variation.
In addition, a previous study[10] by our group showed decreases in breast and colorectal cancer screenings of 89% and 84% for April 2020.
Similarly, we expect checkups and dermatitis to occur mainly on weekdays and not weekends. The weekday distribution of routine physicals and dermatitis treatments were computed and then the median of the weekday encounters sums obtained. If the sums for Saturday or Sunday exceeded 0.25 times the median, we concluded a shift of at least 7 days. [Fig. 5] summarizes the decision tree applied to the encounter data from the sites in our study.
Fig. 5 Outline of the procedure used to detect the presence and degree of date shifting in the datasets studied.
Results
We applied our date shifting detection methods—looking at sentinel events (COVID-19 pandemic), seasonal patterns for certain diagnoses, and weekday patterns for elective encounters—to 76 sites in the United States. Twenty-two sites exhibited a conflict, an inconsistency between our preexisting records of the presence of data shifting and comparison of the observed data patterns with our synthetic models. We contacted those 22 institutions in an attempt to reconcile these conflicts. We established a dialog with 17 of 22 sites where we asked the sites to double-check their date shifting status and we reevaluated our interpretation of the model to arrive at a consensus. The remaining five sites did not respond to our inquiries, and we excluded these sites from further analyses.
These conflict resolution attempts allowed us to fine-tune our date shifting detection methodology. Most significantly, we made the decision to remove seasonal URI from consideration as it did not appear as predictive as anticipated. Respiratory diseases do not appear to be as sharply seasonal as we originally hypothesized. When we compared URI with the other seasonal predictors, we found that it was more likely to give false positives, potentially due to differences in onset and duration of URI waves by location within the United States, or due to differences in how the URI diagnoses are captured.
Seasonal predictors also are dependent on geographic location. At a top level, whether the data source was in the northern or southern hemisphere. Our decision to confine this study to U.S. data providers was primarily based on removing geographic considerations from seasonal predictors, although north–south location within the United States is still a factor on the onset and degree of seasonal predictors.
For the 71 HCOs in this study, [Fig. 6] shows the observed presence of date shifting. Thirty-nine organizations, or 55%, displayed no date shifting by our methodology. This conclusion was confirmed by individual data providers. Our methodology concluded that another 28 organizations showed some amount of date shifting, which was also confirmed by the providers. For four organizations our methodology's conclusions differed from the provider's description and we were not able to resolve the conflict to a mutual satisfaction.
Fig. 6 Observed presence of date shifting by study methodology, which was in agreement with data provider's description of their dataset. Datasets where there was disagreement between the study's methodology and the provider on the presence of date shifting are shown as “conflict.”
[Fig. 7] shows the distribution of the number of days for which dates were shifted among the 28 HCOs in the study for which the observed date shifting was confirmed by the data provider. A quarter of the shifting institutions do so by 7 days, and a fifth by 365, with the rest in the middle.
Fig. 7 Distribution of the magnitude of the date shift (in days) for the 28 health care organizations with confirmed observed date shifting.
Discussion
Prediction of Date Shifting Presence
The most reliable encounter observation for predicting date shifting is the routine medical checkup (ICD-10-CM Z00) when its distribution over day-of-week is extracted ([Table 1]). In the United States these encounters should rarely if ever occur over the weekend (Saturday or Sunday). This conclusion is applicable only to countries like the United States where these exams are normally scheduled for Mondays through Fridays exclusively. Other medical encounters, such as for sunburn, are not reliably tracked in datasets. Encounters for upper respiratory infections or influenza, while having seasonal increases, are not sufficiently distinct during certain months.
Table 1
Correlation between observed encounter and date shift detection
Measure
|
Yearly/day-of-week
|
Correlation
|
Number of HCOs
|
Sunburn
|
Yearly
|
0.71
|
30
|
Influenza
|
Yearly
|
0.69
|
70
|
URI
|
Yearly
|
0.20
|
70
|
Checkup
|
Yearly
|
0.25
|
61
|
Dermatitis
|
Yearly
|
0.53
|
70
|
Checkup
|
Day-of-week
|
0.92
|
61
|
Dermatitis
|
Day-of-week
|
0.79
|
70
|
Abbreviation: HCO, health care organization; URI, upper respiratory infection.
Note: The strength of the correlation demonstrates the measure's ability to correctly predict date shifting.
[Fig. 8] shows the distribution by day of week of routine medical checkup encounters for the HCOs studied. The left plot shows the expected pattern with almost no encounters occurring on Sunday or Saturday. The data provider confirmed that no date shifting was applied to this dataset. The right plot shows an almost uniform distribution of these encounters over every day of the week. The degree of date shifting for this dataset, confirmed by the data provider, was ± 7. Thus, date shifting of any magnitude from a week to a year will obliterate the expected weekday-only occurrence of routine medical checkups.
Fig. 8 Distribution by day of week of routine medical checkup encounters for the health care organizations studied.
Another limitation of using day of week occurrence of routine encounters was observed for the single data provider who chose to shift their data by a multiple of 7 days, thus preserving the day of week of the encounter. This form of date shifting can be exposed by the unexpected volume of routine encounters on holidays such as Christmas day, New Year's Day, Memorial Day, Independence Day, Labor Day, and Thanksgiving.
No Potential for Date Reidentification
Concern that this methodology may be improperly used to restore (reidentify) actual encounter dates is understandable. It should be kept in mind that use of actual encounter dates is permissible if IRB approval has been obtained for use of an LDS. As mentioned in the Background section above, whenever individual patient data are loaded or refreshed in a dataset, a random patient-specific date offset value within the range chosen by the provider is applied to that patient's dates. For example, if the dataset offset range is ± 365, the first patient's data may randomly be assigned an offset value of −34. All dates in that patient's data in the dataset has 34 days subtracted. The second patient in the dataset may be randomly assigned a date offset of +182. All dates in that patient's data in the dataset has 182 days added. Determining that the entire dataset is date shifted by up to 365 days using our methodology, it is not evident how one would know what offset value to apply to the dates in each patient record in the dataset. (In this example, adding 34 days for the first patient while subtracting 182 days for the second patient.) Possibly one could do multiple attempts adjusting each patients' dates by all possible combinations of date offset values up to 365 to find the adjusted dataset with the minimal deviance from nonurgent encounters occurring on weekdays. But there is no guarantee that this minimum deviance dataset is correct or unique. To reiterate, the methodology to detect the presence and maximum amplitude of a date shift we are proposing is neither meant to circumvent the deidentification in any way, nor can the authors see how it could be used for this purpose.
Our study shows that shifting by a varying random number of days for each patient's chart provides a signal that the dates have been shifted. This signal is desirable as it indicates the dataset is not suitable for studies in which actual dates are necessary. However, if the intent is to confuse or hide the date shifting from the research data consumer, then this can be achieved by using nonrandom numbers of days for shifting, particularly multiples of 7. We do not recommend this obscuring of the application of date shifting.
Conclusion
The obscuring effect of date shifting diminishes the usefulness of RWD for studies when actual dates are required. Dataset provenance is not always reliable even when available. Knowledge of whether date shifting has occurred is necessary when using real-world datasets. The objective of this study was to develop methods for detecting date shifting of encounter data in datasets but not for date reidentification.
We found a simple test to detect the existence of date shifting in an EMR dataset from U.S. HCOs, observing whether almost all routine medical exams or similar nonurgent encounters were performed during weekdays is predictive of a date shift of at least 7 days. Other measures such as the patterns of seasonal illnesses or sentinel temporal events allowed us to detect the magnitude of the shift.
For non-U.S. data sources, or datasets not containing routine physical encounters, a comparable test could be developed using other nonurgent encounters having a customary distribution over certain days of the week.
Slightly over one-half (39 of 71, 55%) of institutions in our U.S. sample do not shift dates. Of those that shift, a quarter do so by ± 7 days and about one-fifth by ± 365 days, with the rest shifting by various amounts in between. The heterogeneity of the shift magnitude is telling; there does not seem to be a widely accepted agreement on the application of date shifting among HCOs.
Clinical Relevance Statement
Clinical Relevance Statement
This study provides methodology for users of RWD to ascertain whether the dates in the data being used has been shifted in value and to what degree if it had. This contributes to providing valid results of studies utilizing RWD.
Multiple-Choice Questions
Multiple-Choice Questions
-
What is the most reliable encounter type for detecting date shifting?
-
Emergency surgeries
-
Routine physicals
-
CT scans
-
Phlebotomy
Correct Answer: The correct answer is option b. These encounters are scheduled and do not occur unexpectedly. They are usually scheduled only on certain days of the week. (For example, routine physicals usually occur only on weekdays Monday through Friday in the United States.) Hence, they should display an appropriate day-of-week pattern.
-
Can real-world datasets ever contain actual complete dates and be HIPAA Safe Harbor-compliant?
-
No, never
-
Yes, if dates are at least 5 years ago
-
Yes, if it is an LDS
-
Yes, if leap years are not included
Correct Answer: The correct answer is option c. Under HIPAA, an LDS is health information that excludes certain direct identifiers (such as patient's name) but may include certain dates with all elements. Use of an LDS requires IRB approval and is not considered deidentified data under the privacy rule.