Keywords
Data mining - prediction models - clinical decision support - data reuse - practice management
1. Background and Significance
1. Background and Significance
1.1 Scope of Problem
To achieve the “Triple Aim” of (a) better health outcomes, (b) better healthcare delivery, and (c) lower costs [[1]], managed care programs seek to improve interactions between informed, activated patients and prepared, proactive providers [[2]], including preventive care [[3]]. Unfortunately, healthcare often fails to provide effective coordination of care across a target population [[4], [5]]. When care coordinators do not know which of their patients are most “at risk” for increased healthcare needs, they typically allocate their time by responding to the patient in front of them at the moment [[6]].
Predictive analytics can be used to rapidly spot opportunities to improve care management [[7]]. Because 5% of patients incur 50% of healthcare expenses [[8]], population health managers need to focus preventive and longitudinal care on those patients who are at highest risk of increased utilization. This approach can facilitate the transition from traditional “reactive” models of medical care [[6]] to one of maintaining health and avoiding preventable conditions. Focusing proactive and preventive care on these high-risk patients directly addresses the Triple Aim by lowering costs and improving health outcomes, and indirectly may also improve healthcare delivery [[9]].
1.2 Limitations of Current Methods
Current models that predict health risks for community-dwelling older adults achieve discrimination measures up to about 70%, as measured by c-statistic [[10]]. Because regression analysis and other traditional statistical methods are constrained by the limited number of attributes that can be used [[11]], most predictive algorithms have focused on specific conditions such as diabetes [[12]] or hypertension [[13]]. However, population health managers need predictive analytics that identify patients at increased risk for all-cause healthcare utilization.
Higher accuracies have been achieved by more specialized prediction models, such as one for imaging utilization [[14]]. Other investigators [[15]–[17]] have built successful models on the basis of demographic and utilization characteristics using a limited subset of clinical data. However, these strategies may not fully exploit the highly detailed clinical history available in electronic health records (EHR). Other studies [[18]] have used rich clinical data to identify practice patterns without explicitly predicting outcomes. Data mining algorithms permit the development of models that use the rich data available in the EHR [[19]], without limiting predictions to specific chronic conditions or high-level summaries (such as restricted EHR data).
2. Objective
The objective was to demonstrate the usefulness of unrestricted EHR data for predictive analytics in managed healthcare.
3. Methods
3.1 Population
LIGHT2 (Leveraging Information Technology to Guide Hi-Tech and Hi-Touch Care) was a Health Care Innovation Award from the Centers for Medicare and Medicaid Services to examine the use of advanced health information technology and care coordination in a managed population. The LIGHT2 program recruited primary care patients at the University of Missouri Health System who were already enrolled in Medicare or Medicaid. The cohort comprised of 9,568 patients who were enrolled in LIGHT2 on or before July 1, 2013.
3.2 Data Source
We retrieved all patient diagnoses, prescriptions, and other clinical attributes from the EHR of the University of Missouri Health System as maintained by clinicians during the fiscal years ending in 2012 and 2013.
3.3 Data Selection
We selected hospital and clinic charges as the outcome of interest for this study because they are easily measured, continuously distributed, and can be compared comprehensibly between diverse patients or populations. We selected the 5% of patients (n=479) with the highest health system charges during FY2013 (the fiscal year ending on March 31, 2013). The FY2013 charges for this top 5% ranged from $94,896 to $3,029,833; and the top 5% accounted for 49.7% of charges incurred by the entire LIGHT2 cohort for that year (►[Figure 1]). The FY2012 charges for the top 5% of patients in that fiscal year ranged from $63,967 to $4,288,603, which we used to define the independent variable of high prior-year cost.
Fig. 1 Logarithmic distribution of FY2013 charges by patient
Mining data to contrast two or more conditions, or contrast mining [[20]], requires comparison groups from comparable populations. Other data reduction techniques such as principal components are less than ideal for several reasons: they do not make explicit use of the known-groups nature of the problem, are not well suited to binary data, and would be computationally impractical with the large number of characteristics considered here. Furthermore, both principal component analysis and factor analysis aim at finding linear combinations of features as opposed to identifying individual features that best discriminate between groups. For this application of contrast mining, we used multiple comparison groups in order to test the flexibility and robustness of the methodology under varying input conditions. We first excluded patients with zero healthcare system charges on the grounds that individuals with no recent hospital or outpatient visits may not have current medical histories in the healthcare system EHR. Therefore, the comparison groups comprised each of the lowest non-zero 5%, 10%, 20%, 30%, 40%, and 50% of FY2013 charges (►[Table 1]).
Table 1
Comparison groups from patients with lowest non-zero charges in FY2013
|
5%
|
10%
|
15%
|
20%
|
30%
|
40%
|
50%
|
Lowest charge in range
|
$27
|
Highest charge in range
|
$470
|
$853
|
$1,221
|
$1,621
|
$2,646
|
$4,300
|
$6,963
|
Percentage of all charges
|
<0.1%
|
0.2%
|
0.5%
|
0.8%
|
1.8%
|
3.5%
|
6.1%
|
3.4 Data Projection
The EHR records at the time of data collection contained a mixture of diagnosis codes from the International Classification of Diseases, 9th Revision (ICD-9) nomenclature and the Systematized Nomenclature of Medicine (SNOMED). The patient records selected for contrast mining contained 3,998 unique SNOMED codes and 3,615 unique ICD-9 codes. These records also contained 10,725 unique medication prescriptions and nine demographic attributes (i.e., age, gender, race/ethnicity, marital status, English fluency, Medicaid coverage, high prior-year (FY2012) costs, body mass index (BMI), and history of adherence to prescription instructions.
We also categorized the 3,615 ICD-9 codes in the dataset into 612 diagnosis-related groups (DRG), and the 10,725 prescriptions into 55 higher-level therapeutic classes. All 19,014 attributes were collected for the selected patients at the end of FY2012, prior to the FY2013 outcome of interest (►[Figure 2]).
Fig. 2 Data selection, projection, and mining
3.5 Data Mining
In order to process contrast mining algorithms, we built a distributed association-rule mining (ARM) tool suite on Apache Spark in HDFS (Hadoop Distributed File System) [[21]]. Because ARM requires binary values, we transformed all variables (i.e., attributes) to true-or-false flags using a PHP script. Because ARM analyses identify the presence of attributes in each combination, but cannot identify the absence of any attribute or combination of attributes, flags must be coded for all possible categorical values in association rule mining (even when the categories are mutually exclusive), rather than the n-1 categories used in traditional regression. For example, we transformed each categorical variable (i.e., race/ethnicity and marital-status) to a set of binary values: (a) “race/ethnicity=white-non-Hispanic or not, =Hispanic or not, =African-American or not, =Asian or not, =Native-American or not, =other or not, =unknown or not,” and (b) “marital-status=single or not, =married or not, =divorced or not, =widowed or not.” We transformed the two continuous variables (i.e., age and BMI) to binary flags after transformation to standard [[22]] categories: (a) “age=18–24 or not, =25–44 or not, =45–64 or not, =65–84 or not, =85-or-older or not,” and (b) “BMI=less-than-18.5 or not, =18.5–24.9 or not, =25–29.9 or not, =30-or-higher or not.” For each of these sets of binary values created from categorical variables, only one is true for any given patient. For example, if “marital-status=married” is true, then “marital-status=single,” “=divorced,” and “=widowed” are false.
We then discovered frequent attribute combinations using an “Apriori” algorithm [[23]] with a minimum support of 0.2 (i.e., excluding attribute combinations found in less than 20% of transactions or fewer than 192 out of 958 patients). We chose this parameter, which should identify 20% of 5% of the population or 1% overall, in order to strike a balance between the recognition of rare conditions in an intrinsically sparse dataset and the elimination of outliers that could misrepresent typical clinical histories. We limited results to attribute combinations that included the outcome of interest (i.e., FY2013 charges over $94,895 or not).
3.6 Statistical Confirmation
In the second step, we dissected the attribute combinations found frequently (20% or more) in patients with high utilization and infrequently in patients with low utilization into individual attributes. Because some age categories were found infrequently in some comparison groups but not in others, “Age” was restored to a continuous integer variable; and because all patients were marked as either “female” or “male”, the “male” flag was dropped and all patients were marked as female or not. We then treated these contrasting attributes as hypotheses to be tested with multiple regression, using the entire population as the validation set.
We used forward selection with p < 0.05 as the entry criterion to add attributes to a simplified regression model for each comparison group. Interaction terms were not included. Because the dependent variable was expressed as a binary classifier (high vs. low utilization), we used logistic regression [[24]] to construct the risk prediction model. For each candidate predictor we calculated the Variance Inflation Factor (VIF) resulting from the regression of that variable on the other candidate predictors. None of the VIF values exceed 3.8, substantially less than the standard rule of thumb that a VIF of 10 or greater signals instability in the regression coefficients [[25]]. In addition, we examined influence plots from the final model to see if individual cases exerted extreme influence on the regression coefficients, identifying no remarkable observations.
The discrimination of the resulting prediction was evaluated by testing the predicted outcome against the actual outcome (FY2013 charges over $94,895 or not) for the entire study population of 9,581 patients. Discrimination was defined as the c-statistic, or the area under the receiver operating characteristic curve of sensitivity versus one-minus-specificity [[26]]. Each comparison group (lowest non-zero 5%, 10%, 15%, 20%, 30%, 40%, and 50%) was contrast-mined independently against the 5% of patients with highest FY2013 charges, and the resulting models were tested independently. The attributes common to all these models also were used to derive a combined model using all FY2013 observations, which was also tested independently.
4. Results
Contrast mining of 19,014 clinical attributes from the first year of EHR data for 479 high-utilization patients and comparison groups with low-utilization patients (ranging from the lowest 5% to the lowest 50%) identified 5,188 attribute combinations frequently found (support of 20% or more) in patients with high utilization in the second year, but infrequently in other patients (►[Table 2]). Not all combinations were infrequent in all comparison groups, but at least 5,178 of the 5,188 were found in all seven contrast mining analyses. These 5,188 contrasting combinations were made up of 67 unique attributes (►[Table 3]). Logistic regression of the 67 attributes found eleven attributes to be significantly (p<0.05) associated with high utilization (►[Table 4]). The elven attributes comprised four diagnoses (i.e., depressive disorder, essential hypertension, ischemic heart disease, and osteoarthrosis), one demographic attribute (i.e., obesity), and six prescription types (i.e., anti-infectives, benzodiazepines, beta-adrenergic blocking agents, quinolones, respiratory agents, and selective serotonin reuptake inhibitor antidepressants).
Table 2
Ten (out of 5,188) combinations frequently associated with high utilization
Attribute Combination
|
Support
|
Narcotic analgesics, Analgesics, Platelet aggregation inhibitors
|
0.21
|
Antihyperlipidemic agents, Analgesics, HMG CoA reductase inhibitors
|
0.39
|
Antidepressants, ICD9=311 (Depressive disorder), Antihistamines
|
0.20
|
Beta-adrenergic blocking agents, Cardioselective beta blockers, Nutritional products
|
0.29
|
Narcotic analgesics, Respiratory agents, Nutritional products
|
0.20
|
Race=White, Salicylates, Antiplatelet agents, Platelet aggregation inhibitors, Age=65to84
|
0.25
|
Antiplatelet agents, Analgesics, Beta-adrenergic blocking agents, Platelet aggregation inhibitors
|
0.33
|
Vitamins, Gastrointestinal agents, Salicylates, Nutritional products, Antiplatelet agents
|
0.20
|
Narcotic analgesics, Anxiolytics/sedatives/hypnotics
|
0.25
|
Narcotic/analgesic combinations, Gastrointestinal agents, Laxatives
|
0.23
|
Table 3
Individual attributes found in combinations associated with high utilization
Size of low-cost comparison group:
|
5%
|
10%
|
15%
|
20%
|
30%
|
40%
|
50%
|
Number of contrasting combinations:
|
5178
|
5180
|
5188
|
5179
|
5179
|
5179
|
5179
|
Age=25to44
|
X
|
X
|
-
|
-
|
-
|
-
|
-
|
Age=45to64
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Age=65to84
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Race/ethnicity=White/non-Hispanic
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Female
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Male
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Obesity
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Taking Rx as prescribed
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Taking Rx not as prescribed
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Medicaid
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Prior High Cost
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
ICD9=250 (Diabetes mellitus)
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
ICD9=272.4 (Hyperlipidemia)
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
ICD9=311 (Depressive disorder)
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
ICD9=401.1 (Benign essential hypertension)
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
ICD9=401.9 (Unspecified essential hyper tension)
|
-X
|
X
|
X
|
X
|
X
|
X
|
X
|
ICD9=414 (Ischemic heart disease)
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
ICD9=715 (Osteoarthrosis)
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Adrenergic bronchodilators
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Alternative medicines
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Analgesics
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Angiotensin converting enzyme inhibitor
|
sX
|
X
|
X
|
X
|
X
|
X
|
X
|
Antiarrhythmic agents
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Anticonvulsants
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Antidepressants
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Antidiabetic agents
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Antiemetic antivertigo agents
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Antihistamines
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Antihyperlipidemic agents
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Anti-infectives
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Antiplatelet agents
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Antipsychotics
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Anxiolytics, sedatives and hypnotics
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Benzodiazepine anticonvulsants
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Benzodiazepines
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Beta-adrenergic blocking agents
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Bronchodilators
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Calcium channel blocking agents
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Cardioselective beta blockers
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Cardiovascular agents
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Dermatological agents
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Diuretics
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Gamma-aminobutyric acid analogs
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Gastrointestinal agents
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
HMG CoA reductase inhibitors
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Hormones/hormone modifiers
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Iron products
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Laxatives
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Minerals and electrolytes
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Miscellaneous analgesics
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Miscellaneous anxiolytics, sedatives and hypnotics
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Muscle relaxants
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Narcotic/analgesic combinations
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Narcotic analgesics
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Nonsteroidal anti-inflammatory agents
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Nutraceutical products
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Nutritional products
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Platelet aggregation inhibitors
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Proton pump inhibitors
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Quinolones
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Respiratory agents
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Salicylates
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Skeletal muscle relaxants
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
SSRI antidepressants
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Thiazide and thiazide like diuretics
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Vitamin and mineral combinations
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Vitamins
|
X
|
X
|
X
|
X
|
X
|
X
|
X
|
Table 4
Regression model of attributes significantly (p < 0.05) associated with high utilization
Attribute
|
Coefficient
|
p-value
|
Odds Ratio
|
95% Confidence Limits
|
Diagnoses
|
ICD9=311 depressive disorder
|
0.5568
|
<0.0001
|
1.707
|
1.343
|
2.168
|
ICD9=401.9 unspecified essential hypertension
|
0.3967
|
0.0007
|
1.423
|
1.128
|
1.795
|
ICD9=414 ischemic heart disease
|
0.5939
|
<0.0001
|
1.828
|
1.386
|
2.411
|
ICD9=715 osteoarthrosis
|
1.0479
|
<0.0001
|
2.769
|
2.192
|
3.499
|
Demographic Attribute
|
Obesity (BMI ≥ 30)
|
2.3520
|
<0.0001
|
9.496
|
7.530
|
11.976
|
Prescription Types
|
Anti-infectives
|
0.4136
|
0.0060
|
1.504
|
1.117
|
2.025
|
Benzodiazepines
|
0.2975
|
0.0139
|
1.307
|
1.026
|
1.665
|
Beta-adrenergic blocking agents
|
0.2832
|
0.0148
|
1.314
|
1.047
|
1.649
|
Quinolones
|
0.4916
|
0.0087
|
1.674
|
1.158
|
2.421
|
Respiratory agents
|
0.3030
|
0.0063
|
1.340
|
1.076
|
1.668
|
Selective serotonin reuptake inhibitor (SSRI) antidepressants
|
-0.4062
|
0.0019
|
0.655
|
0.506
|
0.847
|
*Intercept = –4.2585 with p < 0.0001
The c-statistic of the resulting model was 0.8436, with a 95% confidence interval of (0.8227, 0.8645). By assuming sensitivity and specificity errors to be equally important, an optimal threshold for the model was calculated to minimize the distance to the upper left corner of the operator receiving characteristic graph (►[Figure 3]). This distance was calculated as
and tuning the model to this threshold produced a sensitivity of 0.770, a specificity of 0.812, a positive predictive value of 0.202, and a negative predictive value of 0.983. Please refer to the second paragraph of the “Primary Findings” (Section 5.1, below) for an interpretation of these measures.
Fig. 3 Receiver operating characteristic (ROC) curve for the final model
5. Discussion
5.1 Primary Findings
A novel, two-step combination of EHR data mining with multiple logistic regression yielded a manageably small number of clinical attributes, which accurately predicted the 5% of patients who incurred nearly 50% of healthcare expenses. The model presented here has the virtue of simplicity and interpretability while still achieving an area under the ROC curve of 0.84, markedly higher than ROC value of 0.7 reported in comparable models [[10]]. Although adding interaction effects and nonlinear effects of continuous variables (e.g., age) to the logistic model might slightly improve this already reasonably high accuracy, it would come at the cost of a more complex model that might impede clinical interpretation. We felt that this model performed adequately without the added complexity, and demonstrated the methodology using unrestricted EHR data.
While the positive predictive value of 20% and negative predictive value of 98% appear low and high, respectively, they are reasonably useful given a population in which only 5% of patients are truly positive for high cost, and 95% of patients are negative. For example, a positive predictive value of 20% would result in five patients receiving the intervention of care management for every patient actually destined to incur high costs without intervention. This over-treatment penalty may be reasonable because care management is both extremely safe and relatively inexpensive, and because the 98% negative predictive value of the model would direct population health managers away from nearly all patients who will not incur the highest 5% of costs without the intervention.
These examples demonstrate the utility of mining the rich data available in the EHR to predict the small number of patients who will incur the majority of healthcare expenses, which support population health managers in focusing preventive and longitudinal care more effectively. This could support the Triple Aim [[1]] by improving health outcomes (for example, improving blood sugar control or blood pressure control in high-risk patients), improving healthcare delivery (for example, proactively reaching out to patients with unmet health management needs), and reducing costs (for example, using earlier lower-cost interventions such as frequent outpatient visits to reduce expensive inpatient stays).
All of the four diagnoses found to be associated with high utilization are among the ten most expensive medical conditions in the U.S. in 2013 [[28]]: (a) ischemic heart disease (second most expensive), (b) depression (third), (c) osteoarthrosis (fifth), and (d) hypertension (eighth). Of the prescription types found to be associated with high utilization, beta-adrenergic blocking agents may be indicative of ischemic heart disease (second most expensive); benzodiazepines may be indicative of depression (third), and respiratory agents may be indicative of chronic obstructive pulmonary disease (sixth). The partial congruence of the sample model with the medical conditions known to be most expensive validates the generalizability of these findings, while demonstrating the potential for other, novel discoveries (i.e., a nearly ten-fold increase in the odds of high costs associated with obesity, increased risks associated with anti-infectives in general and quinolones specifically, and risk reduction associated with SSRI antidepressants).
This sample prediction model for high healthcare utilization, or similar models derived using the same methodology, may be more suitable for secondary prevention than primary prevention since many of the associated attributes are chronic conditions or therapeutics. For example, identification of hypertension and obesity as risk factors for high utilization should alert population health managers to monitor blood pressure and body weights more closely in high-risk patients, or review their medications more often. This method would also be applicable to disease-specific models or to other outcomes of interest, such as inpatient, emergency, or outpatient charges considered separately. Multiple models could be created from the same algorithm by limiting the population sample (for example, to patients with diabetes or those with hyperlipidemia) or by excluding some attributes which may not be interesting or may not be actionable (for example, excluding patients with high prior-year costs, or testing demographics and diagnoses but ignoring prescriptions).
The coefficients of the final regression model can be used to calculate a relative score [[29]] for all patients in a population (►[Table 4]). This score gives an approximate relative risk of high utilization in the upcoming year, and patient interventions could be prioritized by ranking these scores. Alternatively, clinical alerts could be triggered for patients with scores exceeding a given threshold. By adjusting the threshold of the scoring system, the sensitivity and specificity of the model could be tuned to identify only as many high-risk patients as can be managed. However, because population health management is a low-risk and relatively low-cost intervention, clinical applications may benefit from greater sensitivity even at the price of lower specificity.
Some common attributes (e.g., gender=female, gender=male, race= white-non-Hispanic, or age=65–84 in this population) were found in attribute combinations associated with high utilization, but they clearly were not independent predictors of high utilization since they also were found in attribute combinations associated with low utilization or not predictive of utilization. This may explain why no demographic attributes other than obesity were identified in the final model. It is surprising that age and high prior-year costs were not significant predictors of high utilization, and these attributes may be found to be predictive in other populations.
Dissecting the associated combinations into separate attributes yields more robust predictors by generalizing the specific combinations of attributes found in a given population, reducing the number of rules (from thousands to tens, in this case), and testing the combined effects of the attributes by traditional statistical methods to identify the significant predictors
5.2 Limitations
While data mining techniques other than contrast mining can be used to discover associations with continuous outcomes, the focus of this demonstration was on a policy-relevant binary outcome: “high cost” and “not high cost,” based on the well-supported contrast between patients in the higher 5% and lower 95% of costs [[8]]. Multivariate regression is not limited to binary outcomes, however, and linear regression on actual charges could have been used to describe or predict the central portion of the cost distribution.
Because this was a single-system study, the generalizability of these results to other populations is not clear. Predicting high hospital and clinic utilization reflects an important outcome of interest, but may exclude some patients who died in the second year before incurring charges high enough to exceed the measurement threshold. Furthermore, at the time these data were gathered, the University of Missouri EHR was undergoing a transition from ICD-9 to SNOMED coding. Since the same disease may have been recorded with an ICD-9 code in some patient records and a SNOMED code in others, the predictive power of some diseases may have been split between two diagnosis codes that were unrecognized synonyms. Lastly, hospital charges were used as a proxy for healthcare costs, but claims data would be a more accurate source of cost information.
5.3 Future Research
The implementation of these predictors as clinical alerts would allow quantitative and qualitative measurement of their clinical impact, in order to test the hypothesis that this predictive methodology can facilitate more efficient deployment of preventive and longitudinal care. Comparing these results to prior literature would help determine their clinical utility, and future studies might also survey expert clinical opinion as to the utility of these predictors of high utilization in population management. In addition, it would be useful to duplicate this method with other patient populations, with higher and lower support values for “frequent” associations, and with expanded data sources including geospatial and socioeconomic attributes.
Further studies are also needed to incorporate Medicare and Medicaid claims data for the LIGHT2 enrollees during the measurement period and to expand the attribute set with socio-economic status attributes, second-order attributes such as number of co-morbidities and poly-pharmacy, and intervention data such as nursing contacts and disease-management training.
6. Conclusions
A novel, two-step analysis of the electronic health records of 9,581 Medicare and Medicaid patients generated hypotheses with contrast mining and tested them with multiple logistic regression. This method yielded multiple similar models, each comprising a manageably small number of attributes that accurately predicted which patients would be in the 5% of patients with the highest healthcare utilization in the following year. The similarity of the models derived from varying comparison groups illustrate the flexibility and robustness of this approach. Because this method is not hypothesis driven, but draws predictors from the broader set of inputs available in a clinical EHR, it has the potential of discovering novel predictors, which may make it particularly useful in improving predictive discrimination over existing hypothesis-driven models. The method identified both expected and novel predictors including four diagnosis codes (i.e., depressive disorder, essential hypertension, ischemic heart disease, and osteoarthrosis), one demographic attribute (i.e., obesity), and six prescription types (i.e., anti-infectives, benzodiazepines, beta-adrenergic blocking agents, quinolones, respiratory agents, and SSRI antidepressants).
By predicting the small number of patients who will incur the majority of healthcare expenses, this method can support population health managers in focusing preventive and longitudinal care more effectively. This model, and similar models developed by combining contrast mining with logistic regression on readily available EHR data, could be used by population health managers to further the “Triple Aim” of better health outcomes, better healthcare delivery, and lower costs [[1]].
Questions
ANSWER: B. The Chronic Care Model [[2]] proposed improving the effectiveness of interactions between patients and providers as a way of promoting the “Triple Aim” of healthcare: better health, better care, and lower costs [[1]]. By bridging the implementation gaps in the Chronic Care Model, well-designed predictive analytics support the transition from the traditional “reactive” model of medical care [[6]] to one of maintaining health and avoiding preventable conditions [[3]]. Predictive analytics are potentially powerful tools for predicting population health outcomes [[7]].
ANSWER: C. Predictive analytics can be used to rapidly spot opportunities to improve care management [[7]], but regression analysis and other traditional statistical methods are constrained by the limited number of attributes that can be used [[11]]. However, a two-step process of data mining to reduce the number of candidate predictors followed by multiple regression to test the remaining candidates will permit the development of models that surpass the size restrictions of traditional statistical methods.