4 Applying the Proposed Methods
According to Section 3, we selected (first step) the following major journals: Artificial Intelligence in Medicine (AIIM); Applied Clinical Informatics (ACI); Computer Methods and Programs in BioMedicine (CMPB); Computers in Biology and Medicine (CBM); Information Systems (IS); Journal Health Informatics Research (JHIR); IEEE Journal of Biomedical and Health Informatics (JBHI); Journal of Biomedical Informatics (JBI); Journal of Medical Informatics (JMI); Methods of Information in Medicine (MIM); Journal of the American Medical Informatics Association (JAMIA). Reading and analyzing (second step) the issues of the mentioned journals over the last five years requested to go through 7,652 papers: out of them, 201 papers dealt with the topics of CIS, HIS, and EHR or similar.
We then completed (third step) the search by asking the most common search engines: we considered Google Scholar, Medline and its search engine PubMed from the National Library of Medicine, and the computer science bibliography DBLP (Digital Bibliography & Library Project). Particular attention was paid to conferences, annual conferences, and proceedings from the following series: Artificial Intelligence in MEdicine (AIME); ACM Bioinformatics, Computational Biology and Health Informatics (BCB); IEEE International Confer ence on Healthcare Informatics (ICHI); American Medical Informatics Association (AMIA) Annual Conference.
The papers so far considered and selected were then classified according to a taxonomy (fourth step), discussed in Section 4.1.1. [Figure 2] graphically depicts the proposed taxonomy for CIS, HIS, and EHR systems.
Fig. 2 Taxonomy of the major dimensions of Clinical Information Systems (CIS), Hospital Information Systems (HIS), and Electronic Health Records (EHR) as reconstructed from the considered literature.
The list of papers selected in the fourth step was then further refined (fifth step) to extract those papers that also covered the topics of AI. The resulting some 200 papers were then grouped (sixth step) according to a taxonomy, discussed in Section 4.1.2. [Figure 3] graphically depicts the proposed taxonomy.
Fig. 3 Taxonomy for the use of Artificial Intelligence (AI) techniques on healthcare data as reconstructed from the considered literature.
[Table 1] numerically summarizes the papers we considered.
Table 1
Every cell reports data in a X / Y / Z format, where X counts the papers which fit the four high-level topics of [Figure 3]; Y counts the papers which fit the keywords of "Clinical Information Systems"; and Z counts the grand total of papers published by that journal during that year. Totals are reported per journal (rightmost column) and per year (bottommost row)
Journal name
|
2014
|
2015
|
2016
|
2017
|
2018
|
total per journal
|
Artificial Intelligence in Medicine - AIIM
|
-- / 1 / 55
|
2 / 2 / 59
|
1 / 4 / 53
|
2 / 1 / 64
|
-- / 1 /53
|
5 / 9/284
|
Applied Clinical Informatics - ACI
|
-- / 2 / 74
|
-- / 2 / 59
|
-- / 2 / 85
|
-- / 5 / 96
|
-- / 7 / 85
|
0 / 18 / 399
|
Computer Methods and Programs in BioMedicine - CMPB
|
-- / 2 / 216
|
-- / 2 / 122
|
-- / 4 / 286
|
-- / 6 / 235
|
2 / 8/299
|
2 / 22 / 1158
|
Computers in Biology and Medicine - CBM
|
-- / 3 / 207
|
-- / 3 / 320
|
-- / 3 / 276
|
1 / 4/281
|
1 / 2/306
|
2/15 / 1390
|
Information Systems - IS
|
-- / 0 / 76
|
-- / 1 / 110
|
-- / 1 / 77
|
-- / 1 / 117
|
-- / 0 / 56
|
0 / 3/436
|
Journal Health Informatics Research - JHIR
|
DNA
|
DNA
|
DNA
|
1 / 1/ 10
|
-- / 1 / 21
|
1 / 2 / 31
|
IEEE Journal of Biomedical and Health Informatics - JBHI
|
1 / 3 / 214
|
2 / 4 / 218
|
-- / 0 / 170
|
2 / 5/180
|
1 / 5/176
|
6 / 17/ 958
|
Journal of Biomedical Informatics - JBI
|
2/10 / 166
|
1 / 11 / 207
|
3 / 5/202
|
5 / 6 / 218
|
3 / 6/190
|
14 / 38 / 983
|
Journal of Medical Informatics (Elsevier) - JMI
|
-- / 8 / 97
|
2 / 5/113
|
-- / 4 / 148
|
1 / 7/182
|
1 /4 / 185
|
4 / 28/725
|
Methods of Information in Medicine - MIM
|
-- / 0 / 71
|
-- / 4 / 88
|
-- / 1 / 74
|
1 / 5 /70
|
-- / 2 / 34
|
1 / 12/337
|
Journal of the American Medical Informatics Association - JAMIA
|
2 / 6 / 174
|
2 / 7/188
|
2/ 10/196
|
1/9/190
|
1 / 10 / 203
|
8 / 42/951
|
total per year
|
5/35/1350
|
9/41/1484
|
6/34/1567
|
14/50/1643
|
9/46/1608
|
43/206/7652
|
Finally, we considered the most relevant contributions we identified in the literature (seventh step) and surveyed in detail in Sections 4.2–4.5. In order to reduce the number of papers from some 200 to some 40, we selected those papers which, according to our evaluation, were introducing some major novelties and innovations.
4.1 Taxonomies
In this paper, we identified two taxonomies: one related to the major dimensions of a CIS, HIS, and EHR, and the other related to the use of Artificial Intelligence techniques over data collected by a CIS, HIS, and EHR.
4.1.1 Taxonomy of Major Dimensions
We describe here a taxonomy including the major dimensions according to which we classified clinical information systems, as depicted in [Figure 2]. The taxonomy we propose here differs from already published taxonomies, such as [7], [8]. In fact, existing taxonomies either focus on health information technology in general, and do not go in depth with clinical information systems, or focus on the success factors for a clinical information system, and do not provide the reader with a real high level taxonomy for clinical information systems.
According to our approach, the five major dimensions we identified considering the more than 200 papers (step 5 of [Figure 1]) retrieved from the literature are the following ones. “Target” considers the approach of CIS, HIS, and EHR (patient-oriented; pathology- or problem-oriented; genome-oriented); “Goal” considers the main reason for collecting data by a CIS, HIS, EHR (everyday practice; clinical trial, specific research, experimentation, or validation; research-oriented); “Application domain” considers the environment and the main characteristics of CIS, HIS, and EHR (in home or admitted patient; chronic disease, time-oriented medical record; rural/ urban areas); “Technology” considers the architecture of the computer system on top of which the CIS, HIS, or EHR runs (distributed, federated, on the cloud; blockchain; interoperable system and HL7 (Health Level 7), HIE (Health Information Exchange), FHIR (Fast Healthcare Interoperability Resources); mobile or desktop access; open system; privacy, anonymization, and data protection); “Use of data” refers to the aim according to which stored data are then processed (prognostic, predictive; personalized and precision medicine; indicator extraction; data quality and care quality evaluation; demographics; process mining and pathway identification; learning, data analytics, data mining, text mining, lexical indexing, machine learning; clinical decision support system (DSS); pattern identification and clustering; information extraction; natural language processing (NLP); cost estimation and prediction; insurance and claims; data optimization in large scale records).
As an example for the taxonomy of [Figure 2], one instance of a CIS can be described according to the five dimensions above: some dimensions may assume more than one atomic value, i.e. some attributes within one dimension are not mutually exclusive. For example, the CIS can be patient-oriented (according to the “Target” dimension, the CIS collects data for every single patient, which may present different pathologies), for the everyday practice (as “Goal”, the CIS is for the everyday practice), for follow up visits (according to the “Application domain” dimension, the CIS collects data in a time-oriented manner), storing data in the cloud and permitting a mobile access (according to the “Technology” dimension, the CIS is cloud-based and with mobile access interfaces - two values for the same dimension), and using data to support the clinician in decision-making (according to the “Use of data” dimension, the CIS exports data to a decision support system).
4.1.2 Taxonomy of the Use of AI Techniques
The dimension “Use of data” from the taxonomy of [Figure 2] is then used to further identify the taxonomy of the techniques, taken from Artificial Intelligence.
After reading, analyzing, and grouping the some 200 retrieved papers (step 5 of [Figure 1]), the taxonomy we propose on the use of AI techniques identifies four major high level topics: the first one is Learning, Data Analytics, Predictive and Personalized Medicine (LDAPPM), which includes learning (i.e., extracting new knowledge), data analytics, data mining, lexical indexing, machine learning, pattern identification, clustering, prognostic, predictive, or readmission prediction; then Decision Support Systems, which refers to the use of clinical information to support decision-making activities in clinical contexts; Natural Language Processing, which includes processing and mining text-based clinical information; and Process Mining and Pathway Identification (PMPI), which includes mining and identification of healthcare/clinical processes and care pathways. [Figure 3] graphically depicts the proposed taxonomy.
4.2 Learning, Data Analytics, Predictive and Personalized Medicine (LDAPPM)
This category is very wide, including many related and, sometimes, overlapping topics. We summarized the category as LDAPPM, represented by 44 papers over the grand total of considered papers: the major journals we encountered during our review, and that focus on the topic of LDAPPM, are JAMIA and JBHI. LDAPPM is the most populated topic among those identified by the taxonomy of [Figure 3]. Also, LDAPPM is one of the few high level topics for which the literature reports many detailed special issues and survey papers: we consider some of the most relevant ones in the subsection
4.2.1 Special Issues and Survey Papers
One contribution is a guest editorial from Chiu and Li [9] in the journal Computer Methods and Programs in Biomedicine, and it focuses on improving healthcare management with data science. The guest editorial highlights a relevant use of an Electronic Medical Record system to general near realtime estimations from big data analysis: the application domain is that of monitoring flu epidemics, and the interesting aspect is that of predicting admissions to triage and to hospital due to that type of epidemics.
The review article from Mehta and Pandit [10] in the International Journal of Medical Informatics provides an interesting survey on concurrence of big data analytics and healthcare. The paper starts with a collection of definitions about what big data and big data analytics are. The paper, then, provides the reader with a taxonomy of existing systems according to three main criteria: sources of healthcare data (e.g., electronic medical records (EMRs), diagnostics, medical claims, prescription claims, clinical trials, social media, wearable, and sensors); big data analytical techniques (e.g., cluster analysis, data mining, graph analytics, machine learning, neural networks, pattern recognition, and spatial analysis); and big data applications (e.g., genomics, drug discovery, personalized healthcare, precision medicine, elderly care, and many others). The paper also highlights that most of the studies depicted by the literature still have a relatively narrow scope, and limited practical applications. Moreover, most of these studies come from developed countries, and no deployment to data from developing countries is seen in the immediate near future.
Andreu-Perez et al. [11] also propose an overview of recent advancements on big data in the IEEE Journal of Biomedical and Health Informatics. The authors extend their analysis and the concept of big data analytics from medical and health informatics to translational bioinformatics, sensor informatics, and imaging informatics. Thus, big data are just not only those stored by traditional EMR and Clinical Information Systems, but include any type of patient’s data. The authors also arise some critical issues not to be neglected when dealing with big data, including privacy, security, data ownership, data stewardship, and governance. In fact, such big collections of data are on the one side extremely relevant to progress in clinical research, but on the other side they may interfere with one’s private life - and one may want not to share his/ her personal details of private life.
Ravi et al. [12] in the IEEE Journal of Biomedical and Health Informatics provide the reader with a comprehensive survey on the deployment of deep learning techniques in health informatics. As in [11], the authors highlight potential usage in health informatics, translational bioinformatics, medical imaging, pervasive sensing, and public health, and they particularly focus on one specific technique for data analytics (deep learning) among the existing ones. The authors also sketch out some limitations and challenges to be faced: convolutional neural networks are deployed in a black box approach, and no modification is applied in case misleading classifications are detected; while several experiments have been performed in the literature, most of them rely on relatively small datasets or focus on rare diseases, and consequently the error on the training set is very small, but results cannot be profitably generalized to new situations which have not been already observed; preprocessing of data still remains a critical step, influencing the overall performances, and the proper dimensioning of the many parameters of a deep neural network still is a blind process which deserves accurate validation; sensitivity to noise (and, as a proof, also to voluntary introduced noise) is still to be improved, as well as in any other data analytics approach.
The guest editorial from Yang and Veltri [13] in the journal Artificial Intelligence in Medicine focuses on intelligent healthcare informatics in big data era. Among the papers presented by the guest editorial, the contribution from Kavuluru, Rios, and Lu [14] describes an empirical evaluation of supervised learning techniques, which read some 71K electronic medical records from in-patients, and assign diagnostic codes. This approach uses a subset of 1,231 ICD-9-CM (International Classification of Diseases, Ninth Revision, Clinical Modification) codes, out of a full set of 4,723 distinct codes: as expected, better results are achieved for those diseases whose training set includes at least 50 cases, while more rare diseases - or diseases with a smaller training set - are poorly classified.
The methodological review of Parimbelli et al. [15] in the Journal of Biomedical Informatics deals with patient similarity for precision medicine. The authors first report a taxonomy of data types used to detect patients’ similarity such as molecular, clinical, and laboratory data, as well as imaging/bio signals, data integration, and patient-reported outcomes. The authors, then, report a taxonomy of applications domains where patients’ similarity is investigated: cancer, nervous system/mental health, integumentary/exocrine system, respiratory system, digestive/excretory system, musco/ skeletal system, cardiovascular/circulatory system - to mention the most relevant ones. Finally the authors report a taxonomy of approaches used to detect (and, possibly, measure) patients’ similarity: clustering, dimensionality reduction, similarity, supervised clustering. Most relevant, the authors envisage concentrating research efforts on the integration of patient similarity measures with decision support system, to boost research on prediction medicine.
Shickel et al. [5] in the IEEE Journal of Biomedical and Health Informatics provide the reader with a survey on recent advances in deep learning in analyzing electronic health records. The taxonomy proposed by the authors focuses on machine learning techniques (multilayer prediction, convolutional neural networks, recurrent neural networks, autoencoders, restricted Boltzmann machine) and deep learning applications (electronic health record information extraction, electronic health record representation learning, outcome prediction, conceptual phenotyping, clinical data de-identification). The authors also clearly highlight the major limitations of current research on the topic, which refer to model interpretability, data heterogeneity, and lack of universal benchmarks. This latter one is the most relevant topic the research has to focus on - authors envisage.
Bisaso et al. [4] in the journal Computers in Biology and Medicine survey machine learning applications on HIV data from medical records. The authors are from Uganda, and their paper is one of the few papers from developing countries, where the relevance of HIV and HIV-related diseases is extremely high - as well as epidemic. The considered work focuses on papers and data both from medical care and from research communities. The authors clearly demonstrate how the trend of research moved from considering electronic medical records only, to include generic, imaging, and lab data. In fact, up to 2002 data extracted from electronic medical records were the only source of information. Starting from 2008, the key role in providing research with useful information has been played by genetic data, while information extracted from traditional electronic medical records has become less and less relevant.
4.2.2 Promising Results in Some Application Domains
The literature includes some papers on specific application domains where the use of LDAPPM seems to lead to promising results. The paper by Lara et al. [16] in the Journal of Biomedical Informatics aims at deploying data mining techniques on data related to events in time series. The paper focuses on EEG (electroencephalography) data, and it uses data from a publically available data source (EEG recordings), while extension of the approach to data from clinical information systems is straightforward. Deployed data mining techniques mainly include adaptive fuzzy inference neural networks - AFINN neural networks. The general framework is based on an event definition language, which improves the overall performance of the approach, as well as its applicability to other medical domains such as electrocardiography (ECG).
Kipnis et al. [17] in the Journal of Biomedical Informatics target the alerting of inpatient deteriorations. The goal of the paper is that of detecting the evidence of physiologic derangements for a given patient with a reasonable advance (6 to 24 hours according to the authors) prior to actually observing the deterioration for that patient. This prediction is based on data collected form the electronic health record of the patient: it is not a predictive medicine system based on genomic or ancestral data analysis. The alerting system, namely Advanced Alert Monitor, has been developed from the analysis of some 650 K hospitalization episodes and some 48 M hourly observations. While the system currently performs well over cases with abundant information, the authors are planning to face the challenge of achieving good results also in the cases where data availability is much poorer.
Monsalve-Torra et al. [18] in the Journal of Biomedical Informatics focus on the application domain of patients who underwent a surgery for abdominal aortic aneurysm. This disease features a high rate of mortality and complications with consequent reduced quality of life and higher costs of treatment. Consequently, the estimation of mortality risk is extremely important. The authors deploy machine learning methods based on neural networks and Bayesian networks to build a predictive system which could predict hospital mortality. The considered dataset is made of 57 attributes from 310 cases coming from clinical information systems. The attributes were pre-processed and then fed into the WEKA (Waikato Environment for Knowledge Analysis) system.
Bourne et al. [19] and Margolis et al. [20] in their papers published in the Journal of American Medical Informatics Association focus on an extremely relevant project by the National Institute of Health (NIH) on the Big Data to Knowledge (BD2K) initiative. The BD2K project aims at identifying solutions of biological problems in the shape of methods, tools, software, and training to be shared within the biomedical research community at large. Thus, BD2K wants to maximize the use of biomedical data to extract value from that data, also developing and disseminating data analysis methods. As highlighted by the authors, scientists have to face many challenges, such as expanding the availability and the use of EHRs spread in different formats in different research and care centers. The use of federated data catalogs, as also described by Brisimi et al.
[21], is one step in this challenge.
Lo and Li [22] in their editorial in the journal Computer Methods and Programs in Biomedicine extend the concept of machine learning from alphanumerical data to various image modalities, where images are taken from CIS devoted to patients affected by liver or breast cancer, which are among the most common cancer types for men and women, respectively.
The topic of precision medicine is well described by the paper of Frey, Bernstam, and Denny [23] in the Journal of American Medical Informatics Association. Precision medicine aims at matching genomics to therapeutics for an individual. Such a challenge requires considering big data and learning systems in order to properly identify the optimal treatment of that individual. The typical disease in which precision medicine is applied is cancer, and the authors report about a database from clinical information systems storing some 160 K patients, 64 K of them being tracked across 134 research cohorts.
Machine learning over data from CIS can also help to predict the evolution of the disease of a patient. As an example, Swain and Kharrazi [24] in their paper in the International Journal of Medical Informatics describe a prediction model to estimate the readmission probability over a 30-day period. The model belongs to the category of Readmission Risk Prediction Models (RRPM), and it considers 297 prediction variables. These variables are extracted from HL7 messages transmitted by Health Information exchange Organizations. The model helps in preventing unplanned hospital readmissions.
Turgeman and May in their paper published in the journal Artificial Intelligence in Medicine [25] describe a mixed-ensemble model for hospital readmission prediction. Their predictive model is based on a C5.0 tree classifier, coupled to a Support Vector Machine (SVM) to increase the performance of the classifier. The model has been applied to data of some 20 K inpatient admissions for some 4,8 K patients suffering from congestive heart failure. The model reaches a total accuracy of about 85% of the cases, thus proving its efficacy.
Zhao et al. [26] in their paper published in the Journal of Biomedical Informatics describe how machine learning can benefit from considering heterogeneous temporal data coming from EHRs. Traditional machine learning algorithms work on data collected with tables. The approach of the authors moves from the considerations that clinical events are unevenly distributed over time. Temporal machine learning, i.e., machine learning which leverages on the temporality of considered data, can exploit this feature and substantial improvements can be achieved by better focusing on collected data and the temporal distance among events.
Miotto and Weng [27] in their paper published in the Journal of American Medical Informatics Association describe how reasoning on data from EHRs about diagnosis, medications, lab results, and clinical notes, can be used to identify patients eligible for clinical trials. The approach described, known as cohort selection, moves from data of patients already enrolled in clinical trials, and reasons to profile the ’target patient’. Patients who comply with this ’target patient’ can then be enrolled for the trial. The approach was tested on 262 patients already enrolled, and used to select new patients for the trial from a population of some 30 K patients.
Moskovitch et al. [28] in their paper published in the Journal of Biomedical Informatics propose a framework (namely, Maitreya) for predicting medical events in order to prevent disease, to understand disease mechanism, and to increase patient quality of care. The approach moves from data stored in clinical information systems, considering some 4.5 M patients, and focuses on duration and gaps of events, which are sparse in time, to discover frequent time interval related patterns (TIRP): patterns are then used as prognostic markers. The approach has been successfully applied to 28 frequent, clinically relevant procedures.
Shknevsky, Shahar and Moskovitch [29] in their paper in the Journal of Biomedical Informatics deal with frequent interval-based temporal patterns to be discovered in clinical data of patients suffering from chronic diseases (cancer, hepatitis, and diabetes). Detected patterns (TIRP - frequent time interval related patterns) are then used to cluster patient clinical trajectories, thus predicting the evolution of the disease. The authors performed a deep consistency check, to ensure that similar TIRPs are constantly and repeatedly discovered in similar groups of patients.
Zhang et al. [30] in the journal Methods of Information in Medicine1 use logistic regression, natural language processing, and neural networks techniques over clinical data from emergency departments (ED) to predict hospital admissions or transferring. The goal is that of predicting the care pathway of a patient, after some basic data have been collected as the patient presented to the ED and underwent a triage process. Moving from an archive of some 47 K ED visits in 642 hospitals, 48 principal components were extracted and used for prediction. The authors claim a relevant improvement in prediction by the mixed deployment of the three techniques.
4.3 Clinical Decision Support Systems (DSSs)
The category DSS is represented by nine papers over the grand total of considered papers: major journals we encountered during our review and that focus on the topic of DSS are JAMIA and JBHI.
Bennett and Hardiker [31] in the Journal of American Medical Informatics Association review the literature on the use of computerized clinical decision support systems (CCDDSs) in EDs. The use of CCDSS is extremely relevant in EDs, where time to decision must be as shorter as possible and physicians can really benefit from a CCDSS: the efficacy of the treatment also relies on the time required to start the treatment itself. According to the survey, patients over 70 years are five times more likely to be admitted than patients younger than 30 years, meaning that a CCDSS for an ED must be particularly sensitive to chronic diseases. The survey detects 23 studies from the literature which evaluate the impact of CCDDSs in EDs. Surprisingly, only 13 out of the 23 studies identify a significantly positive impact on the clinical care, the authors write.
Ohno-Machado [32] in the Journal of American Medical Informatics Association in 2014 recalls that the informatics community has addressed the structure of EHRs to be the basis of clinical decision support systems (CDSSs) for many decades. The author highlights that, at time of writing, there still is a separation between research and clinical health systems (most of them are problem-oriented or pathology-oriented) and translational research involving genomic and clinical data.
Wright et al. [33] in the Journal of American Medical Informatics Association survey and classify the reasons which lead to CDSS alert malfunctions. The survey detects 68 cases of alert malfunctions (the rules of the CDSS do not fire and the CDSS doesn’t send out alerts to physicians to highlight abnormal situations) in 14 sites through the US. Detected malfunctions are then classified according to a taxonomy the authors propose.
The taxonomy includes four major dimensions for malfunctions: cause of the malfunction (build errors; release of new codes which make the rules of the CDSS obsolete; defect of the EHR; computer environment migration); mode of discovery (mainly user reporting); start of the malfunction (which starts as the CDSS is deployed); and effect on rule firing (wrong rule action or system slows down due to some rules). The major result of the paper is in the effort of identifying malfunctions, so that further releases and/or systems could avoid the pitfalls.
Rahulamathavan et al. [34] in their paper published in the IEEE Journal of Biomedical and Health Informatics consider the problem of preserving the privacy of clinical data. The problem occurs when a physician sends patient’s data to an outsourced DSS via the Internet, to check the answer from the DSS: the outsourced system may be unreliable or not compliant to the policies of the CIS where patient’s data are originally stored. The authors propose a new encryption algorithm which fits such privacy needs, and which can be enriched and extended to cover the needs of transferring patient’s data to cloud computing systems.
Yoon, Davtyan, and van der Schaar [35] in their paper published in the IEEE Journal of Biomedical and Health Informatics consider the problem of predicting the evolution of the disease of a patient to suggest the optimal therapy. In fact, the paper proposes a discovery engine (DE) which moves from the patient’s characteristics and data stored in the clinical information systems to perform a personalized prediction. The engine detects which are the most relevant characteristics to exploit. The main feature of the DE is that performance remains good also in case of large number of contexts, the authors claim. As application domain, the DE has been applied to breast cancer and related therapies, providing the physicians with an average improvement of 2.18–4.20% with respect to traditional DSSs.
4.4 Natural Language Processing
Natural language processing (NLP) is needed to manage clinical information, as different kinds of clinical information are acquired in an unstructured or semi-structured form. Accordingly, NLP in medicine is a wide and vital area of research, where one of the underlying goals consists of extracting knowledge from natural language texts coming from different sources, as medical records, reports, or social media. Within this category, we have both survey papers that confirm the high liveliness of this research community, and research papers ranging on different clinical and methodological issues.
In Kreimer et al., [36], the authors follow a sound and systematic approach to consider and discuss existing NLP systems specifically developed for clinical domains. The analyzed systems allow the extraction of structured information from unstructured free texts. Different bibliographic databases were considered for the survey and, at the end of an articulated screening and selection phase, 86 papers were considered in detail. They describe 71 different clinical NLP systems. Such systems range over different clinical and research tasks by adopting different techniques. While some tasks are suitably acknowledged and sound solutions have been proposed, such as, for example, the identification of medication information and the extraction of cancer features from pathology reports, some other challenges remain unsolved, as, for example, the extraction of temporal information or the mapping of concepts expressed through natural language expressions to standard terminologies.
In Ford et al. [37], the focus is on the role of EMRs with respect to health-related research. Accordingly, the authors propose a survey narrower in the scope with respect to the previous one, and analyze papers that deal with incorporating information from texts into algorithms that support the identification of clinical cases. The authors, after a systematic search through literature, identified 67 papers focusing on the extraction of information from free text of EMRs, with the explicit goal of detecting cases having a specific clinical condition. This survey highlights that the considered papers mainly deal with US EMRs. Both rule-based and machine learning methods have been adopted without any clear difference with respect to the accuracy of the proposed approaches. Moreover, it is quite evident that including information from text significantly improves the performance of such algorithms, with respect to only considering coded information. Quite interestingly, the authors underline the need to standardize the result reporting of algorithm performances, as for accuracy metrics. The topic of using NLP to process EHRs has also been dealt with by Wang et al. [38] and applied to congestive heart failure patients.
In Goldstein et al. [39], the authors evaluate the effectiveness of a new methodology for the automated creation of meaningful free-text summaries from longitudinal clinical records. Moreover, they consider the potential benefits to the clinical decision-making process, when applying the proposed method to build draft letters that can be manually improved by clinicians. The general knowledge-based system, named CliniText, has been applied for the automated summarization in free text of longitudinal Intensive Care Unit (ICU) medical records, using an ICU clinical knowledge base, created by involving two ICU clinical experts. CliniText generated free-text summary letters for 31 different patients, and such letters were compared with respect to the original discharge letters, written by physicians. The comparison was performed according to different measures as, for example, relative completeness, readability, and “semantic accessibility” (i. e., how fast and correctly other physicians understood the clinical content of the letter). After some interesting quantitative results that confirmed the soundness of the proposed methodology, the authors underline that the use of summarization systems would allow the enhancement of such letters, as for standard structure, completeness, and time required for their composition. Such enhancements would positively influence the quality of decisions made by other clinicians, by considering such summaries.
In [40], Agarwal et al. consider an emerging problem in many healthcare institutions worldwide. Indeed, both for the quality of patients’ life and for reducing the healthcare-related costs, hospital readmission rates have to be monitored by healthcare institutions. Particularly, chronic diseases account for many hospital readmissions which have to be taken under control. In this paper, chronic obstructive pulmonary disease (COPD) has been explicitly considered, as it is highly relevant in many countries and it requires a continuous long-lasting monitoring of affected patients. This work focuses on the use of unstructured clinical notes to statistically predict the patients most in danger of readmission. A framework is proposed, which uses natural language processing for the analysis of clinical notes. The prediction of readmissions is based on the selection of most suitable algorithms within the field of data mining and machine learning.
The paper [41] by Safari and Patrick considers natural language issues with respect to the capability of clinical users to simply perform complex research-oriented queries on EMRs. More specifically, authors focus on the support of research questions involving internal time-event dependencies, through cascaded queries. The proposed approach is based on an extension of the recently proposed Clinical Data Analytics Language (CliniDAL). Different aspects of research-oriented queries have been considered, from the elicitation of subjects to be considered in a study to the time span of the experiment, to the control group definition. Three different scenarios have been considered for evaluation. Such evaluation confirms that the proposed system can support the expression of complex queries involving also temporal aspects by users not aware of Structured Query Language (SQL) details.
4.5 Process Mining and Pathway Identification
Recently, there has been an increasing attention to represent and reason about complex and coordinated execution of clinical and healthcare activities. Such intertwined activities compose complex processes that may stem from the application of clinical practice guidelines, from care pathways, which may be viewed as the application of guidelines and internal good practice policies to specific domains and clinical environments, or, more generally, from organizational and clinical plans. According to one of the most common definitions of care pathways (CPs), we may consider them as complex interventions for the mutual decision-making and organization of care processes for a specified group of patients during a given period. Such CPs, often characterized by complex decision-making tasks and data-intensive activities, need to be represented, designed, managed, analyzed, and discovered [42], [43]. According to the focus of this survey paper, we consider here clinical process and care pathway mining, where AI-based approaches have been widely applied.
In Rojas et al. [44], the authors provide a survey of healthcare process mining, which consists of deriving knowledge from data generated and stored in healthcare/hospital information systems, to analyze/discover the executed processes. Seventy-four papers with associated case studies have been considered and discussed. In particular, eleven main features characterize such analysis: process and data types; frequently asked questions; process mining techniques, perspectives, and tools; methodologies; implementation and analysis strategies; geographical analysis; and medical fields. Such survey underlines and suggests different techniques and tools to adopt for healthcare process mining. Moreover, it underlines the importance of process mining for supporting process-aware information systems. Adopting process-aware information systems could provide benefit both for the quality of the performed healthcare processes and for the optimal use of the related resources.
In Huang et al. [45], the discovery of CP patterns is faced through clinical event logs, which record various treatment activities. The authors propose a novel approach to CP pattern discovery by representing CPs through an extension of the Latent Dirichlet Allocation family that jointly models various treatment activities and their occurrence. In order to evaluate both applicability and soundness of the proposed approach, two real-world scenarios have been considered, namely that of unstable angina, and of oncology. The obtained results show the feasibility of the proposed CP pattern mining approach.
In Gotz et al. [46], the authors present a methodology for interactive pattern mining and analysis from past clinical patient data. The method supports an ad-hoc visual exploration of patterns mined. The proposed approach combines the support of visual queries for the interactive specification of clinical episodes to look for with pattern mining techniques that allow the discovery of significant intermediate events inside a clinical episode. Moreover, interactive visualization techniques are integrated that allow the user to identify event patterns that are associated to specific outcomes together with their temporal behavior. A prototype implementation is presented as a proof-of-concept of the proposed methodology and its successful application to some real world clinical domains, namely that of heart failure, hypothyroidism, and hypertensive patients, is described.
Mining and visualizing CPs from EHR data is also the main topic of [47] where Zang et al. propose a practice-based CP development process and a data-driven methodology for deriving common clinical pathways from EHR data. Such patient-centered approach aims at facilitating evidence-based care. Indeed CPs helps translate best available evidence into clinical practice, suggesting the most suitable treatment sequences for specific therapy-based goals. The authors focus on visit data of chronic kidney disease patients who developed acute kidney injury in some given years. Such data were extracted from the EHR and mapped into one-dimensional sequences using novel constructs designed to capture information related to different visit facets, as purpose, procedures, medications, and diagnoses. Clustering visit sequences allows the identification of distinct patient subgroups. Markov chains have been used to characterize visit sequences. Significant transitions are extracted and visualized into CPs across subgroups. According to clustering results, CPs provide insights about the evolution of patients’ conditions and medication prescriptions over time. Pathways associated to typical disease progression have been identified, as well as practices consistent with guidelines. Visualization of pathways depicts the likelihood and direction of disease progression within different contexts.