Keywords
Artificial Intelligence - machine learning - expert systems - open science - open
source
Introduction
Artificial intelligence (AI) is a broad term that encompasses a range of technologies
(many of which have been under development for several decades) that aim to use human-like
intelligence for solving problems [1]. The earliest AI systems used symbolic logic (such as “if-then” sequences) to create
so-called “expert systems”. In the healthcare domain, these expert systems have been
designed by clinicians (providing the clinical knowledge) working with programmers
(translating their expertise into symbolic logic). For example, healthcare workers
triaging patients could be guided by an expert system through a series of questions
asked to their patients. They would then enter patient responses into the system which
would be used to determine which questions to ask next and, ultimately, the system
could support the management of patients by suggesting a triage decision. These types
of expert systems are now widely used by services such as the NHS 111 telephone triage
service (currently being beta-tested as a mobile app in the UK [2]) and other commercial symptom-checking apps [3].
Another approach to AI is the use of machine learning (ML) techniques including artificial
neural networks (ANNs). Using an ANN approach, computer programmes create decision-making
networks of artificial “neurons” that operate in ways similar to biological nervous
systems [4]. The main difference between a ML AI and an expert system AI is that the former
is not explicitly specified by experts but rather is created by a process of automated
iterative improvements (supervised or unsupervised by human experts). By using “training”
datasets that match patterns of inputs (such as symptoms, medical images, or biomedical
signals) with specified outputs (such as medical diagnoses), the ML programme can
iteratively learn which network will match the most patterns correctly. The programme
will aim to maximise the “area under the curve” (AUC) when the false negative rate
is graphed against the true positive rate [5]. A predictive ML algorithm will have a low false negative rate and a high true positive
rate resulting in a high AUC on a scale of 0-1.
Although ML has been used for decades, it is only recently that the combination of
sufficient processing power and large enough training datasets has enabled the creation
of ML algorithms that can compete with or out-perform symbolic approaches. The performance
of ML has now been established through a number of high-profile achievements such
as Google’s AlphaGo and its successors winning a series of Go championships from 2015-2017
[6]. Commercial applications of ML include improving search engines (especially for
images and voice-powered searching), image processing, and robotic control systems
(such as to increase automobile safety systems using lane-keeping and automatic braking
and acceleration technology). In healthcare, ML-powered clinical decision support
systems (CDSSs) have been proposed for image analysis and interpretation for radiology
[7], dermatology [8]
[9], and pathology [10], and for improving the scope and accuracy of biomedical signal interpretation [11]
[12]
[13].
Against this recent history of rapid technological developments, there are several
concerns as AI begins to be applied to the healthcare context. Healthcare is a safety
critical industry. In contrast to the low-risk technology demonstrations where systems
such as AlphaGo and IBM Watson have won high-profile game shows or competitions, the
healthcare community will expect a higher level of openness and validation before
AI-based tools are adopted. Through a rapid review of the recent literature and an
IMIA Open Source Working Group discussion, we have examined how the principles of
Open Science [14] ([Figure 1]) could inform the adoption of AI in healthcare.
Fig. 1 Open Science Taxonomy (Attribution: Petr Knoth and Nancy Pontika CC BY 3.0).
Open Data
New ML systems require large datasets to improve their accuracy and predictive capabilities.
Open data is a concept whereby data is freely available under an open license [15]. It offers significant promise within healthcare (as previously reported by the
IMIA Open Source Working Group [16]). There is now a large range of open biomedical datasets available for training
new ML algorithms developed by governments, medical societies, and international research
collaborations ([Table 1]).
Table 1
Examples of Open Data for Biomedical Research.
|
Organisation
|
Open Datasets
|
|
US National Center of Biotechnology Information (NCBI) and National Cancer Institute
(NCI)
|
GEO (Gene Expression Omnibus), GenBank, and PubMed and -omics datasets [17]
[18].
|
|
European Bioinformatics Institute (EBI)
|
Gene-Wide Association Study (GWAS), UniProt [19] and EBI-Pfam (Protein, DNA, RNA, and X family), and other catalogues [20].
|
|
Research Collaboratory for Structural Bioinformatics: Rutgers and UCSD/SDSC
|
The Protein Data Bank (RCSB-PDB) 3D structured protein data bank [21].
|
|
Gene Ontology Consortium (GOC)
|
Gene Ontology, a development framework of computational representation of genes and
their biological functions [22].
|
|
SIB - Swiss Institute of Bioinformatics
CPR - NNF Center for Protein Research
EMBL - European Molecular Biology Laboratory
|
STRING database of protein-protein interactions and outputs data [23].
|
|
Center for Genomics and Personalized Medicine at Stanford University
|
RegulomeDB is also an open database of SNPs with known regulatory elements combined
with GEO and other open datasets [24].
|
|
US National Cancer Institute
|
LIDC-LDI is a lung image database of 1,018 cases [25].
|
The major open datasets outlined above are generally for bioinformatics purposes rather
than for clinical informatics. However, new open datasets for ML derived from clinical
records such as the CheXpert dataset of chest X-rays [26], are now becoming available. When dealing with sensitive healthcare data from electronic
health record (EHR) systems and other clinical systems privacy is a major issue. Even
after de-identification, sensitive healthcare data often cannot be released as open
data [16] and secure environments need to be used to conduct ML analyses on sensitive data
sources [27].
Open Research
There is a large and growing body of research on AI technologies and techniques with
many studies focusing on the predictive abilities of new algorithms. There is less
research, however, on the mechanisms of action (what happens inside the “black box”)
of AI-based algorithms or on the clinical effectiveness of AI-based clinical decision
support systems in real world applications. The “black box” concept in AI research
usually refers to neural networks that are so complex that it can be very difficult
for even skilled programmers to understand how decisions are made. For some uses in
healthcare, a balance will need to be struck between the advantages that the “black
box” approach offers (in terms of enabling a very large number of factors to be weighed
in a particular decision) and the problems caused by the obscurity of the decision-making
process. For digital health companies, however, the “black box” may offer a commercial
advantage and regulators and policy-makers will need to strike a careful balance between
fostering innovation and insisting on openness that may harm commercial interests.
Implementation research shows that even if new medical technologies work “in the lab”,
they might not be readily transferable to real world clinical practice [27]. AI systems may have a high AUC with strong predictive accuracy but run into problems
when used in clinical practice. Technologies need to be aligned with clinical workflows
and be acceptable to clinicians without introducing new problems [28]. For example, in order to be able to use AI-based CDSSs in a clinical setting, users
may need an accurate mental model of how the algorithms are using the various data
sources to make decisions they can trust [28]. With “black box” systems, this may not be possible and could limit their clinical
effectiveness. If healthcare workers give up on understanding how decisions are made,
it could lead to technology over-dependence with consequences on the degradation of
clinical decision-making skills leading to worse healthcare outcomes for patients
[29].
As highlighted by regulatory agencies such as the National Institute for health and
Care Excellence (NICE) [30]
[31] and the US Food and Drug Administration (FDA) [32]
[33], AI-driven CDSSs will usually require a high standard of clinical evidence before
they can be safely used. However, there are now a large number of digital health start-ups
operating in “stealth mode”, with large investments and research and development programmes
but without published evidence of mechanism of action or clinical effectiveness [34]. This conflict is likely to be resolved as new regulations are implemented, but
vigilance will be required by clinicians, researchers, and policy-makers to ensure
that open research approaches are adopted so that evaluations of AI systems are clinically
relevant and reproducible, and can enable informed decisions about whether or not
they should be adopted.
Open Access
The AI community has benefited immensely from the current trend towards open access
publishing. Pre-print archives such as ArXiv.org, open access proceedings from large
AI conferences, and researcher-created GitHub code repositories mean that tools, techniques,
and results of countless experiments are open at no cost to researchers around the
world. To underline the field’s commitment to open access, a recent campaign by the
Journal of Machine Learning Research (JMLR), one of the field’s open-access journals,
advocated for researchers to state when their papers describing ML research were rejected
by closed-access journals and garnered the support of more than 3,000 researchers
[35].
Although major medical publications have not traditionally embraced the open access
publishing movement to the same degree as AI research organisations, there has recently
been a significant shift to the open access publishing model, driven by new open access
requirements from research funders and pressure from academics and clinicians for
greater openness of research.
Open Educational Resources (OER)
Open Educational Resources (OER)
One of the first and largest “massive open online courses” (MOOCs) was on the topic
of Machine Learning [36]. The prevalence of open courseware, now numbering hundreds of courses with millions
of students from around the world, is supporting a growing international cohort of
data scientists and also provides healthcare workers from a wide range of backgrounds
the opportunity to become conversant in AI technologies. As AI technologies are applied
in healthcare, it will be critical that healthcare workers and managers develop an
overall understanding of how AI will impact the future of healthcare provision. Developers
of AI systems will likewise need a good understanding of how healthcare differs from
other contexts in order to ensure systems are safe and clinically effective. There
are now a number of open health informatics courses for both healthcare workers and
software developers [37]
[38]
[39]
[40] that cross this interdisciplinary boundary. These courses will need to keep up to
date with new developments in AI as they emerge.
Open Tools
The field of artificial intelligence has a long history of using open source tools
for creating AI systems and managing the large datasets needed to train neural networks.
[Table 2] shows examples of major open source ML software libraries, many of which have been
released by large Internet companies such as Amazon and Google. These organisations
may not be able to generate fees from software licences but can still generate significant
income for the use of their cloud systems based on open source frameworks. This business
model supports the further development of open source frameworks. These tools are
routinely used by healthcare AI research groups with, for example, Google’s TensorFlow
library and open data being used for classifying lung cancer and predicting mutations
[41].
Table 2
Major open-source machine-learning libraries and their management bodies.
|
Framework
|
Organization
|
License
|
|
TensorFlow [42]
|
Google
|
Apache 2.0
|
|
Torch [43]
|
Facebook
|
3-Clause BSD license
|
|
Microsoft Cognitive Toolkit [44]
|
Microsoft
|
MIT license
|
|
Chainer [45]
|
Preferred Networks Inc.
|
MIT license
|
|
Caffe2 [46]
|
Facebook
|
3-Clause BSD license
|
|
Scikit-learn [47]
|
Community base
|
Apache 2.0
|
|
DSSTNE [48]
|
Amazon
|
Apache 2.0
|
[Table 3] shows statistical software together with other software packages used for ML applications
in healthcare. GNU R and CRAN (Comprehensive R Archive Network) are statistical languages/environments
with packages used for ML [49]. WEKA is an open-source data mining and ML tool [50], and TopHat is a collection of bioinformatics tools utilized for gene ontology and
mapping [51].
Table 3
Software packages frequently used for AI in healthcare.
|
Name
|
Description
|
License
|
|
GNU R
|
Statistical language/environment
|
GPL3
|
|
WEKA
|
Data mining and machine learning tools
|
GPL3
|
|
Tophat
|
A spliced read mapper for RNA-Seq
|
Boost Software License 1.0
|
Conclusions
The field of AI has adopted open science approaches to education, data sharing, research,
and software development. These approaches have also been adopted for AI research
and development in the healthcare domain. However, as AI-based data analysis and CDSSs
begin to be implemented in healthcare systems around the world, further openness of
clinical effectiveness and mechanisms of action will be required. The use of “black-box”
systems or the introduction of systems that have not demonstrated clinical effectiveness
may not be acceptable approaches for the safety-critical healthcare context.