Keywords
electronic health records - information storage and retrieval - health system - clinical informatics research - data quality
Background and Significance
Background and Significance
Clinical informatics researchers often depend on the reusability of electronic health record (EHR) data to design many of the new methods and systems that improve clinical practice and research. For example, innovations such as those that streamline research subject selection from patient populations require access to patient data.[1] Other applications that provide precision medicine at the point of patient care must compute solutions using these data.[2] This research encompasses a wide variety of data reuse, including retrospective analyses, experimental system development, and data modeling. However, EHRs often fall short of this need for reusable data, either lacking the information entirely or storing it in a format that requires time-consuming revisions for machine interpretability.[3]
[4] Clinical informaticians are reporting some of these deficiencies,[5] while making do with extracting and inferring patient information from current EHRs for various purposes.
Given broader discussions surrounding redesigning the EHR are taking place,[6]
[7] it is timely to examine the systems current state in order to advance the EHR and address its glaring shortcomings for clinical informatics researchers. Studies show EHRs continue to miss important patient data[8] or provide other information in a form that is not machine-processable, complicating data analysis.[3] Both shortcomings are critical to overcome for clinical informatics research and suggest that the data content of these systems requires attention. It is imperative to appropriately store information according to data storage standards and properly capture data types. However, updating the information captured by the EHR and revising storage and retrieval methods will be important to advance the health systems for use as a learning health system.[9]
To catalogue EHR shortcomings that limit data reusability, we have conducted a scoping review of the research informatics literature to identify categories of data content that are inadequate or need revision. These categories can serve as a foundation for establishing some of the data requirements for the next generation of EHRs.
Materials and Methods
We conducted a review of the informatics literature to identify discussions regarding the limited reusability of EHR data and to group these discussions into meaningful categories. To accomplish this task, we first performed a broad, preliminary search to locate journals that most frequently contained articles with this type of content. We selected the journals by using the broad search term “electronic health records” in PubMed without any other limitation and scanning the first 2 years of the results. After limiting the focus of the search based on the preliminary investigation, we then used a standardized search strategy and an iterative expert review process (discussed later) to identify the data content categories through consensus. The iterative process was used to ensure as uniform category creation and article categorization as possible. This study design was selected to encompass a broad overview of the discussions in the literature and provide new perspective on areas in EHR data that need to be addressed for reusability. While the study design primarily follows the strategy set forth in the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guideline,[10] several of the elements in the PRISMA checklist were not relevant to the present study as shown in the [Supplemental Material], available in the online version.
Search Strategy
We searched the PubMed database for informatics literature published between January 2010 and March 2016 to find recent, relevant English-language articles that addressed current limitations of EHR data. Our broad, preliminary search through PubMed yielded several journals with greater informatics relevant to the topic and we therefore limited our search to the following journals:
-
Journal of American Medical Informatics Association.
-
New England Journal of Medicine.
-
Journal of Pathology Informatics.
-
Methods of Information in Medicine.
-
International Journal of Medical Informatics.
-
Journal of Medical Internet Research.
-
Proceedings of the AMIA Annual Fall Symposium.
Additionally, the following inclusion criteria were created to limit the scope of the review to address data content in the EHR and the specific limitations currently present in the system:
-
Data not stored in the EHR that would be useful for patient care or research.
-
Data that would be more efficient to store in the EHR but are typically derived from the information already stored.
-
Data stored in an inefficient manner for downstream use (e.g., research or clinical decision support [CDS]).
Articles were excluded from the review if they met one of the following exclusion criteria:
-
No discussion of EHR data.
-
Only discussion of user interface modifications to the EHR for data already present.
-
Only discussion of derived data that, for efficiency reasons, should remain derived rather than stored (e.g., data subject to constant change).
The search pattern began with the broad search term “(electronic health record) OR (electronic medical record) OR (electronic health records) OR (electronic medical records)” to initially capture a wide variety of articles for review. One reviewer (T.I.K.) surveyed the literature from newest to oldest three times to locate articles matching the inclusion criteria. Per iteration, the reviewer evaluated all the articles listed in [Table 1] first by title, then by abstract if the title did not clearly include/exclude the article, and finally by full text if necessary. Following each pass through the literature, all authors served as evaluators of a subset to review the articles selected as meeting the inclusion criteria, the categories of data content created, and the placement of articles into each of the categories. Disagreements were resolved with discussion to reach a consensus.
Table 1
Results from the iterative review process
Evaluation step
|
Articles reviewed
|
Articles categorized
|
IRR
|
Categories postevaluation
|
Evaluation 1
|
655
|
71
|
0.4401
|
8
|
Evaluation 2
|
1,062
|
153
|
0.5567
|
8
|
Evaluation 3
|
1,215
|
165
|
0.6864
|
8
|
Notes: Each evaluation step indicates the number of articles that were reviewed, the number that were put into categories of data content for expanding the EHR, the inter-rater reliability (IRR) after all evaluators reviewed a sample of 30 articles, and the final categories after each evaluation.
Analysis
We used consensus as the primary mechanism for moving forward after each standardized iteration through the literature (described below). However, we also calculated an estimated inter-rater reliability (IRR) after each one. As each evaluator was allowed to classify an article into multiple categories, Kraemer's modified kappa coefficient was used.[11] Briefly, for each iteration, we chose a random subset of 30 articles as a representative yet efficient sample from the total articles reviewed per iteration ([Table 1]). Each evaluator placed the articles into the categories created for the current iteration or chose to create additional ones. The categories were consolidated based on similarity and Kraemer's modified kappa coefficient was calculated by first computing Fleiss's kappa for multiple categories and raters[12] and adding Kraemer's correction for each rater potentially choosing multiple categories.[11]
Article Classification
Our search strategy's goal was to characterize potential areas for expanding the current data content of the EHR. Our three passes through the literature included several article classification steps, followed by evaluation steps to achieve consensus on the categories of data content and the classification of articles into those categories (see [Fig. 1]). We used Zotero 4.0 to manage categorization and access to the articles throughout each pass.
Fig. 1 Literature search and article classification strategy for identifying potential areas of expansion for electronic health record data content. Articles returned by the search term “(electronic health record) OR (electronic medical record) OR (electronic health records) OR (electronic medical records)” were filtered by journal and the remaining articles matching the inclusion criteria were placed into general categories ①. A sample of 30 articles and their respective categories from this first pass through the literature were then evaluated for both the appropriateness of the article, the appropriateness of the categories, and the relevance of the articles to the categories (①). A second version of categories was then created and the process was repeated two more times (② and ③) using the previous set of categories from each pass yielding a final version of categories.
During the first pass through the literature, the reviewer (T.I.K.) created general categories for the articles that met the inclusion criteria. This pass produced the first version of categories to be used for the remaining literature searches. Following review of a set number of articles, we selected a subset of 30 articles for all evaluators to review and independently classify into the list of categories created. During this evaluation, the evaluators reviewed the appropriateness of the articles based on the inclusion criteria, the appropriateness of the categories created and their respective definitions, and the classification of each article. After the evaluation, the evaluators discussed the decisions made regarding each of these topics and made changes based on consensus.
The second pass through the literature included all articles in the chosen time frame. The reviewer classified the remaining articles according to the second version of the categories (see [Fig. 1]). We then chose a second sample of 30 articles for all evaluators to review and classify using the revised set of categories. Again, we resolved disagreements regarding categories, category definitions, and article classification through discussion and created a third version of categories.
The reviewer then performed the third pass through all articles and classified them according to the third version of the categories. We performed an evaluation of 30 articles and discussed the previously mentioned topics to resolve disagreements by consensus.
Results
Overview
Our preliminary search for articles published between January 2010 and March 2016 retrieved 26,031 citations. Using the journals identified in the preliminary search, the standardized search returned 1,215 articles that were then reviewed based on the inclusion and exclusion criteria. The method of consensus used to select articles for each iteration then resulted in a different number of articles selected from the 1,215 articles per iteration as discussed below. Additionally, each iteration, except the final, reviewed only a subset of the 1,215 articles to maintain consensus on category creation and article selection (see [Table 1]).
For each iteration of the categorization process (see [Table 1]), we reviewed a set number of articles for categorization. The first evaluation did not complete the entire time span from January 2010 to March 2016 resulting in fewer number of articles. Additionally, several more articles in the selected journals were added to the literature after evaluation 2 as well. The IRR increased for each iteration, but the number of categories remained constant (see [Table 1]). While the number of categories did not change, we made modifications to the names of the categories and their definitions. For example, from the second round of classification, the authors agreed to change the category “Drug Monitoring” to “Medication List Data Capture” and also changed the definition from:
-
Articles describing a need for or the creation of an algorithm or data model to monitor patient use of drugs whether through abuse, adherence (medication compliance), or incidental use.
-
More robust medication data storage (e.g., medications prescribed by other hospitals, medication and/or illicit drug abuse information), including additional drug metadata (e.g., adherence to medication schedule) that would allow clinicians to easily determine a patient's medication status along with storage of patient medication information in medication list rather than free text.
The categories in [Table 2] represent the final consensus categories and definitions chosen by the evaluators after iteratively classifying the articles. We believe that these categories represent some of the major concerns raised in the literature regarding shortcomings of the EHR to provide data for reusability.
Table 2
Number of articles per category after the final iterative review through the literature
Category
|
Description of a need for or the creation of an algorithm to detect or a data model for…
|
Number of articles (citations)
|
Adverse events
|
The potential for or the occurrence of unexpected and/or undesirable medical events such as drug allergies, drug side effects, falls, unexpected diseases, or other treatment-related injury
|
22
[13]
[14]
[15]
[16]
[17]
[18]
[82]
[83]
[84]
[85]
[86]
[87]
[88]
[89]
[90]
[91]
[92]
[93]
[94]
[95]
[96]
[97]
|
Clinician cognitive processes
|
The clinician's reasons for decisions made in the EHR regarding patient care, including alert overrides and handoffs
|
12
[19]
[20]
[21]
[22]
[23]
[98]
[99]
[100]
[101]
[102]
[103]
[104]
|
Data standards creation and data communication
|
Storage of data for medical fields or aspects of medical fields in a standard medical format (e.g., HL7, C-CDA, or an author-specific format) or mapping of data models of commonly used resources (e.g., Web sites or apps) to standard medical data formats for the purpose of EHR interoperability among other EHRs and external applications
|
29
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[105]
[106]
[107]
[108]
[109]
[110]
[111]
[112]
[113]
[114]
[115]
[116]
[117]
[118]
[119]
[120]
[121]
[122]
|
Genomics
|
Patient genetic information including WGS, whole-exome, SNP, and other genetic data from other tests not listed. The genetic data can be utilized for the purpose of diagnosis, prescription (pharmacogenomic information), or other medically relevant purposes
|
12
[35]
[36]
[37]
[38]
[39]
[40]
[41]
[123]
[124]
[125]
[126]
[127]
|
Medication list data capture
|
More robust medication data storage (e.g., medications prescribed by other hospitals, medication and/or illicit drug abuse information), including additional drug metadata (e.g., adherence to medication schedule) that would allow clinicians to easily determine patients' medication status along with storage of patient medication information in medication list rather than free text
|
14
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[128]
|
Patient preferences
|
Storage of patient's desires for treatment, therapy, or lack thereof for health events such as end-of-life care, diseases, or AEs
|
2
[57]
[129]
|
Patient-reported data
|
An outcome of a health event (e.g., disease, risk factor) or therapy (medication schedule, treatment plan) that is reported by and directly relatable to the patient. Quantification is sometimes done through the abstract score of quality of life. Patient-reported data may not be correlated with medically defined outcomes (increased FEV1 in COPD patients does not always result in improved QOL for a patient)
|
13
[58]
[59]
[60]
[61]
[62]
[63]
[64]
[130]
[131]
[132]
[133]
[134]
[135]
|
Phenotyping
|
Identifying a specific, medically relevant, physical characteristic (e.g., disease state, current treatment, or physical trait) by utilizing the presence of clinical data in the medical record (e.g., laboratory test results, clinical notes analyzed through NLP, or physical exam findings)
|
61
[13]
[14]
[15]
[16]
[17]
[18]
[20]
[24]
[25]
[31]
[32]
[35]
[41]
[42]
[43]
[45]
[46]
[60]
[64]
[66]
[68]
[69]
[70]
[71]
[72]
[73]
[74]
[75]
[76]
[77]
[78]
[79]
[80]
[81]
[82]
[83]
[84]
[85]
[86]
[87]
[88]
[89]
[90]
[91]
[92]
[93]
[94]
[95]
[96]
[97]
[98]
[99]
[100]
[101]
[102]
[103]
[104]
[105]
[106]
[107]
[108]
[109]
[110]
[111]
[112]
[113]
[114]
[115]
[116]
[117]
[118]
[119]
[120]
[121]
[122]
[123]
[124]
[125]
[126]
[127]
[128]
[129]
[130]
[131]
[132]
[133]
[134]
[135]
[136]
[137]
[138]
[139]
[140]
[141]
[142]
[143]
[144]
[145]
[146]
[147]
[148]
[149]
[150]
[151]
[152]
[153]
[154]
[155]
[156]
[157]
[158]
[159]
[160]
[161]
[162]
[163]
[164]
[165]
[166]
[167]
[168]
[169]
[170]
[171]
[172]
[173]
[174]
[175]
[176]
[177]
[178]
[179]
[180]
[181]
[182]
[183]
[184]
[185]
[186]
[187]
|
Note: The categories and their definition of data content described in the literature that could be used to expand the EHR.
[Table 3] presents an overview of the classification process and shows an example of how an article was grouped into each category. The information is presented for the final iteration.
Table 3
Example articles from each category and the description of their categorization process
Category
|
Title of article in category
|
Description of categorization
|
Adverse events
|
A long-term follow-up evaluation of EHR prescribing safety
|
Article discusses the analysis of prescription error rates when transitioning between EHRs. Information retrieval for this study required data derivation from a chart review
|
Clinician cognitive processes
|
A novel use of the discrete templated notes within an EHR software to monitor resident supervision
|
This article discusses the specific documentation of resident procedures outside formal procedures allowing monitoring of resident training and potentially cognitive reasoning behind procedures
|
Data standards creation and data communication
|
A methodology for a minimum dataset for rare diseases to support national centers of excellence for health care and research
|
Specifically discusses standard data elements that could be used for rare diseases for epidemiology studies
|
Genomics
|
An EHR-driven algorithm to identify incident antidepressant medication users
|
Discusses a pharmacogenomics platform that focuses on harvesting this type of data for CDS and reporting. The article specifically deals with genomic data for CDS
|
Medication list data capture
|
Creating a scalable clinical pharmacogenomics service with automated interpretation and medical record result integration—experience from a pediatric tertiary care facility
|
Discusses the design an algorithm for derivation of antidepressant users from the EHR data. Indicates missing information on patient medication lists
|
Patient preferences
|
An information model for automated assessment of concordance between advance care preferences and care delivered near the end of life
|
Discusses storage of advance care preference (a patient preference) information in the EHR in an easier-to-retrieve format
|
Patient-reported data
|
Assessing older adults' perceptions of sensor data and designing visual displays for ambient environments
|
Studies the perceptions of elderly patients toward the use of in-home sensors for the collection of medical data (patient-reported due to sensor collection directly from the sensors). Addresses the collection of this information
|
Phenotyping
|
A collaborative approach to developing an EHR-phenotyping algorithm for drug-induced liver injury
|
Discusses the creation of a phenotyping algorithm designed to identify patients in the EHR with drug-induced liver injury
|
Abbreviations: CDS, clinical decision support; EHR, electronic health record.
Note: Each article's content is described in relation to the reason that it fits in the category (i.e., why it was placed in the category listed).
Categories for EHR Data Content Expansion
Adverse Events
An adverse event (AE) in medicine is any undesirable event that occurs during or as the result of treatment, including falls, adverse drug events (ADEs), and food allergies. While these events may be recorded in the EHR, this information is typically not stored in a structured location and detection of these events post occurrence typically requires searching both structured and unstructured data. Although some studies employ techniques such as rule-based detection[13] and pattern matching in free text[14] as in phenotyping, many more studies utilize a combination of natural language processing (NLP) and machine learning (ML)[15]
[16]
[17]
[18] to search the free text. One study specifically pointed out that while pattern-matching methods could easily extract common side effects of medications, ML (decision trees in this specific study) was more useful for extracting more complex symptomatologies as ADEs.[14] Once extracted, the information has multiple downstream applications including research, reporting, quality improvement, and prevention.
Clinician Cognitive Processes
The clinician's cognitive process during patient care is the reasoning behind decisions made in diagnosing and treating patients. While the category is broad, studies focus on two primary aspects: alert overrides and patient handoffs. A list of reasons for overriding an EHR alert that is customized to be more relevant to a clinician during patient care has been shown to improve the appropriateness of the one chosen.[19] Other studies have repeatedly shown that structured data elements reflecting clinical reasoning are important during the hand-off process, suggesting that clinical reasoning is vital during communication.[20]
[21] Storing cognitive maps that diagram the thought process during patient care is one effective mechanism of storing this reasoning for use during hand-offs.[22] Additionally, one specific study attempted to manage the large amount of information to be analyzed during transfer by studying the effectiveness of a handoff tool that automatically imported relevant information.[23]
Data Standards Creation and Data Communication
This category focuses on the ability to unify the data content across multiple EHRs. Many studies discuss two similar methodologies of unifying the data: (1) standardizing the storage of the data in a component of the EHR itself[24]
[25]
[26] or (2) extracting the data from the EHR in a standardized format agnostic of the underlying data structure.[27]
[28]
[29]
[30] While a handful of these studies, specifically those focused on data communication, attempt to solve the issue of general data unification, other studies attempt to solve only a smaller area, such as oncology,[31] rare diseases,[25] or family health history.[32] Multiple standards have been employed, including HL7, HL7's Fast Healthcare Interoperability Resource (FHIR), the Consolidated Clinical Document Architecture (C-CDA), and the Web Ontology Language (OWL), with varying success. However, the flexibility provided by many of these standards prevents complete interoperability between systems.
Other studies have focused on some of the difficulties in the data modeling process for purposes of standardization. One study evaluated several of the tools used for modeling clinic workflow and suggested that the tools are not mature enough to appropriately handle all modeling requirements.[33] Another study described an application that provided clinicians with feedback regarding data quality and similarity of reporting across multiple hospitals in the Netherlands.[34] While this tool improved data quality, the provision of relevant diagnostic codes still varied greatly between 30 and 100%.
Genomics
Genomic information can include anything from a basic genetic test designed to identify a single-nucleotide polymorphism (SNP) to whole-genome sequencing (WGS). Overall, several studies have highlighted a need to improve the infrastructure used to store genetic information and the need to implement standard ontologies and semantics regardless of a genetic test's originating laboratory.[35]
[36]
[37] The results of a genetic test can be used for CDS, provided that metadata regarding actionability of the test results are available.[38] Pharmacogenetic data (drug-related genetic data) have many similarities to their parent category, genetic data, including the need for standardization[39] and metadata for CDS.[40] However, the relationship to medications and prescriptions opens the potential to creating an association between a prescribed medication and a genetic test, thereby annotating reasoning behind the medication choice or dosage.[41]
Medication List Data Capture
Medication lists are a common component of any EHR today and maintain an active record of a patient's medications, both current and previous. Several studies focused on improving the accuracy of this list utilize both NLP and ML to retrieve additional information from the free text. One study showed the effectiveness of these techniques in monitoring opioid use in patients who did not have opioid use frequency recorded in their chart.[42] Another used NLP to detect antidepression medication use in patients.[43] Other studies have focused on more general omissions in a patient's medication list with one predicting missing medication using an ML algorithm[44] and others highlighting missing medications at discharge.[45]
[46]
A related area of study is the retrieval of both the reason for a medication's prescription and the duration of taking a medication. In the 2009 i2b2 challenge, methods for identifying the reasons for and duration of prescriptions employed pattern matching (such as regex) combined with NLP, heuristics,[47]
[48] and ML.[49] All studies concluded that this type of information was the most difficult to extract.[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
Patient Preferences
While patient preferences for end-of-life care can be found in many EHRs, locating these preferences in the system can be challenging.[55] The difficulty in locating this information adds to the lack of a mechanism for determining if the patient's preferences have been met.[56] One study investigated the use of 15 end-of-life data elements that could easily determine the status of end-of-life care with regard to patient preferences.[56]
Patient preferences also extend to other contexts, including the ability to send a reason for a nurse's call. One study explored the ability to transfer contextual information regarding patients' desires along with the nurse's call that allowed the nurses to more appropriately respond to the patient's request.[57]
Patient-Reported Data
Patient-reported data (PRD) falls into two large subcategories: (1) data that are directly reported by a patient to a clinician as something that is relevant to his or her health and (2) sensor data that are passively collected from the patient through devices such as smartphones or home-based devices. Regarding directly reported data, the literature currently suggests that PRD be collected and stored in a standardized method. Many studies provide possible discrete data elements to be used, including elements of a personal profile, goals for overall health and clinic visits, and quality of life.[58] Other studies focus on incorporating social and behavioral determinants of health as facilitators for patient care.[59]
[60]
Several studies have actually shown feasibility of collecting PRD from sensor data as well. By tying a continuous glucose monitor to a smartphone and eventually to the EHR, one group showed that passive glucose levels could be captured and stored in a patient's chart.[61] Other studies have experimented with using smart home sensors for monitoring elderly patients' health.[62]
[63]
[64]
Phenotyping
Phenotyping is the process of identifying cohorts of patients with desired characteristics (typically disease states). This output may then be used for downstream purposes in research. The most common approaches to phenotyping have been rule based[65] and are the simplest methods of this type potentially utilizing only the diagnosis codes and problem lists in the EHR. However, due to reasons such as missing diagnostic codes from a patient's record or diagnoses absent from the problem list, rule-based algorithms include other information, such as medication lists, laboratory values, and chart reviews to increase the accuracy of phenotyping patients.[66]
[67] Once constructed, the rule-based algorithms are validated and then may be submitted to one of several public repositories for general use such as PheKB eMERGE (Phenotype Knowledge Base: Electronic Medical Records and Genomics), OMOP (Observational Medical Outcomes Partnership), and others.
Rule-based techniques have been frequently used in the past. However, the methods needed to construct, validate, and implement them are time consuming, especially if the rule requires the interpretation of data found in the free text of patient's charts.[68]
[69] As a result, alternative methodologies have been employed to speed up the phenotyping process, including the use of NLP and ML. NLP, which uses linguistic knowledge to allow computers to gather knowledge from language (speech, text, etc.), has been used with rule-based algorithms to assist with mining free text[70]
[71] and in conjunction with ML, a probabilistic modeling technique, to identify patients using large feature sets.[72]
[73]
Because all that is needed is labeled data, both techniques decrease the time needed to validate and implement the algorithm. Unfortunately, the initial construction has a large upfront time cost due to the expense of labeling data to train a ML algorithm. However, there have been recent attempts to decrease the time cost by automating the labeling process.[72]
[74]
Of final note is the accuracy of phenotyping algorithms. All of the aforementioned techniques utilize inferential techniques to phenotype patient cohorts due to the imperfect capturing of patient phenotypes in the EHR. As a result, those wishing to use the phenotyping results for a downstream application must take into account these inaccuracies when deciding on the phenotyping algorithm to use.[75]
Discussion
Clinical informatics research focuses on creating new methods and systems that depend on diverse, high-quality EHR data, which are not always present. The purpose of our study was to identify categories of data content that would promote the reusability of data in the EHR for clinical informatics research. It was not our intent to create an exhaustive list of all data that should be considered for addition as EHRs undergo natural evolution, but rather to focus on those data for which reuse application in informatics are ready and waiting.
One proposed goal for advancement of medical systems is to create the learning health system.[76]
[77] This system would incorporate data from patients, clinicians, laboratories, and many other information sources to translate information to knowledge. Part of the translation process will require the appropriate data to be available in a reusable format. The eight categories we found should be considered a starting point for revising the next generation of EHRs, built with the ability to allow data reuse for clinical informatics research and the advancement of the learning health system.
The categories are not intended to be a classification system for published articles; so, we were not concerned about the occasional differences in how an individual article was classified. Although the IRR in classification increased as we revised our categories with each iteration in the evaluation, it is more important to note that all evaluators ultimately agreed on the definitions of these categories and that they were sufficient to classify all articles that met our inclusion criteria.
The presence of each of these categories in the literature unifies them as pieces of the larger problem of EHR data reuse for clinical informatics research. However, while each of these categories makes up a component of the discussion, they also vary in the scope of data content that they cover. For example, Medication List Data Capture focuses specifically on the medications and associated metadata captured by the EHR, whereas phenotyping covers a broader aspect of data content. This discrepancy is expected due to the varying levels at which EHR data can be reused. Medications cover a focused area of data content, yet represent a major aspect of patient treatment. In contrast, phenotyping, which focuses on identifying disease states, employs a large portion of the EHR's data and is naturally larger than other categories.
Additionally, some categories represent novel information that is not captured by the EHR, such as explicit expression of clinicians' cognitive processes. If this information is present in the EHR, it is typically found only in narrative-free documents such as clinical notes and hand-off documentation. Other categories focus on information captured in a form that needs a revision. For example, phenotyping information is abundant, yet effective use of this information requires complex predictive algorithms that can never be 100% accurate.
It is important to note that the categories proposed have varying degrees of actionability in the clinic, currently. For example, categories such as Adverse Events, Medication List Data Capture, and Patient Preferences can typically be immediately acted on with current medical knowledge and standards of practice. However, other categories, such as PRD and Genomics, may still require more research to make the data use more effectively in the clinic. This last statement, however, highlights the need to make this type of data that are already in the EHR reusable.
While there are many ways of addressing the capture or storage structure of the data content in each category, it is our general observation that most of the projects in each of the categories described in our literature set would benefit from a data storage that used a standardized terminology. However, not all projects were so bold to request this explicitly. Questions remain on how to acquire these data and how to store them appropriately. The current predominant method for accomplishing this is to require clinicians (typically nurses and physicians) to take on new structured data entry tasks (diagnoses, problem lists, medications, allergies, etc.)—often replicating efforts they had already expended in writing their notes and reports. Enthusiasm for such additional responsibilities is often lacking.
Several solutions for each category might exist (see [Table 4]). Solutions that might alleviate the need for excessive data entry will require advances in clinical informatics research and possibly policy changes. For example, most of the categories would benefit from a unified medical record, whether the data are centralized at a single institution or distributed at the patient level.[78] This system would unify and relate medical information across all patients, regardless of the institution providing care. Such a system would require the use of a standardized, research-controlled terminology mentioned earlier and would also necessitate the standardization of a data model for storage. However, both requirements would enforce interoperability. Additionally, using the categories presented in this study as a starting point might guide the creation of an information model that would allow data reusability for researchers with permissions for access. This system would also prevent duplication of data when patients visit multiple health care institutions. Coupled with continued development of data communication standards, such as HL7, a unified medical record would allow data transfer rather than requiring data entry.
Table 4
Potential solutions to each of the categories of data content being discussed in the literature as missing from the EHR or needing revision
Category
|
Solutions
|
Adverse events
|
• Unified medical record
• Voice recognition
• NLP
• Patient portals
|
Clinician cognitive processes
|
• Voice recognition
• NLP
|
Data standards creation and data communication
|
• Unified medical record
• HL7 development and standardization
• Voice recognition
• NLP
|
Genomics
|
• Unified medical record
• Automated laboratory data transfer
|
Medication list data capture
|
• Unified medical record
• Voice recognition
• NLP
|
Patient preferences
|
• Unified medical record
• Voice recognition
• NLP
• Patient portals
|
Patient-reported data
|
• Unified medical record
• Voice recognition
• NLP
• Patient portals
|
Phenotyping
|
• Unified medical record
• Voice recognition
• NLP
|
Abbreviation: NLP, natural language processing.
Note: Some of the proposed solutions are currently implemented in some EHRs but are not developed enough to solve the issue. Others will require significant research and further development of the technologies. Still others might require policy changes.
Additionally, other solutions, such as voice recognition coupled with NLP, could lower some of the barriers perceived by those charged with the data entry. While benefitting multiple categories, transfer and storage of clinician cognitive processes would especially benefit from technology. For some categories, such as AEs, PRD, and patient preferences, information could be captured directly from the patient through patient portals, which are currently in use in limited circumstances, or monitoring devices such as wearables. However, for all data entry methods, barriers could be reduced by making clear the immediate and long-term advantages of new data capture. For clinicians, this might come in the form of intelligent decision support that automates workflow processes. For patients, it might be clear indications of the medical benefits and progress tracking over time that new data capture provides.
It is noted that NLP and voice recognition form a solution for most of the categories that have been mentioned. The exception, genomics, can be addressed through an automated laboratory data transfer. As previously discussed, these technologies might alleviate the burden of data entry. Additionally, these technologies might provide an initial method of implementing a standard from free text rather than requiring rigid data entry. The key point is that both technologies would alleviate the strain on data entry, allowing it to be more flexibly submitted while potentially maintaining storage in a standardized way. These technologies would form an additional benefit to those already mentioned for the unified medical record that also would benefit most the categories through standardizations.
Moving forward, it is important to recognize that modifications to the EHR to address the categories presented here will have a cost associated with them. These costs include developers' time, physicians' time, licensing, and others. The cost might depend on the current status of a category's development in the EHR. For example, genetic information might have a higher cost as it is still in its beginning stages of addition to the EHR compared with other categories. Therefore, it will be important moving forward to justify costs carefully and consider that continued development in one area (such as phenotyping) might assist in some way with others (most genetic information has phenotypic implications).
There have been other studies highlighting the shortcomings of the EHR and suggesting changes. For example, Liaw et al focus on the accuracy of the data present in the EHR, including the completeness of the data as one aspect of that metric.[79] Dixon et al suggest a general framework for addressing data quality in the EHR without discussing specific areas of the data content needing attention.[80] Finally, Cusack et al focus on general recommendations highlighting data collection with respect to patient care as needing the main focus.[81] Each of these studies highlights a different aspect of the shortcomings in the EHR. Our study focuses on specific data content areas that need attention for the clinical informatics researcher in the next-generation EHR. Specifically, we address data reusability in our study. We, therefore, believe that we have highlighted another important area for improvement moving forward.
There are a few limitations of our study; however, we do not believe these have affected our conclusions. Our literature search was limited to articles published in 2010 or later, because we sought categories describing the current state of the EHR and its shortcomings regarding data use for clinical informatics research. Surveying older literature might have identified categories that have already been resolved. Our broad, preliminary search leading to a focus on prominent informatics journals indexed in PubMed may have excluded some articles that may have introduced additional categories. However, we believe it is unlikely that a popular article would completely evade the mainstream literature. Finally, although there may be many other areas of data content that could be added to the EHR, we believe that the categories that we have identified focus on the major areas of discussion in the literature that surround the use of EHR data for clinical informatics research.
Conclusion
Despite 50 years of development, EHRs still remain inadequate for many intended tasks, including clinical informatics research. As the next generation of informaticians takes on the task of developing the next generation of EHRs, we recommend that their plans incorporate new data types and structures guided in part by the eight desirable categories we have distilled from current clinical informatics literature. Although creative approaches will be needed to accomplish this, many promising applications stand ready to exploit these data to improve the care of individual patients and, through a “learning health system,”[9] the health of humankind.
Clinical Relevance Statement
Clinical Relevance Statement
The reusability of electronic health record data provides clinical informatics researchers the ability to create innovative applications for clinical applications. The revisions and additions to the EHR data content that we have discussed will streamline these innovations, providing faster development of these applications for use by clinicians. Additionally, many of the improvements discussed, such as genetic data contend, would affect the ability of the clinician to store, retrieve, and interpret a patient's genetic information in a clinical context.
Multiple Choice Questions
Multiple Choice Questions
1. Of the following, which solution for data storage would offer the most uniform structure for storage and retrieval of the medical information for both research and clinical practice?
-
Unified medical record
-
Universal medical record number
-
Data communication standards
-
Natural language processing
Correct Answer: The correct answer is A, unified medical record. A “Universal medical record number” would allow information to be linked across EHR systems that adhered to the universal medical record number; however, this would not ensure data storage or retrieval in each of these systems would be similar past this relationship.
Data communication standards may provide universal retrieval of data, but will not ensure universal storage. This answer, therefore, is also incorrect.
Finally, natural language processing is an information retrieval methodology that might be able to form a layer between data entry and retrieval to standardize data flow in either direction (storing free text as standardized structured information or retrieving free text as structured information). However, the best implementations of this method are inferential and will always have additional error beyond error in data entry and retrieval themselves.
2. Which of the following categories of information in the electronic health record has the greatest impact on natural language processing in terms of information retrieval?
Correct Answer: The correct answer is B, “Phenotyping.” Natural language processing (NLP) currently plays a major role in most techniques used to attribute phenotypes to patients because it allows free text to be searched in addition to structured data.
Patient preferences are typically stored in structured data in the electronic health record to allow easy retrieval. Additionally, there is not much research in this area to use NLP for retrieving this information from free text.
Data standards creation and data communication is, somewhat by definition, not intended to use NLP, and is therefore incorrect. While it might be possible to store and retrieve information in this manner for data communication purposes, the use of standards should remove the need for the use of NLP.
Finally, genomics is currently not often stored in the EHR through structured data or free text. Additionally, methods other than NLP are typically used for retrieval of structured information from the genetic data.