2 Re-use of Health Data in Europe for Public Health
2.1 The Potential of Health Administrative Data
Several EU projects have promoted the re-use of existing medical databases for research purposes, e.g. [[5], [6]]. Health administrative data exhibit major strengths for research, especially when:
-
The entire population of a region or a country is covered by the data collection.
-
Construction of longitudinal histories of individuals is possible.
-
Multiple years of data allow powerful identification of trends over time.
Data are accessible with low research costs. Data can therefore be used to increase scientific knowledge in a number of areas: epidemiological surveillance, health determinants, health services research, economic evaluations, risk/benefit ratio of drugs and medical devices in “real” population, comparative effectiveness research, and prevention. The re-use of such data allows especially the adequate powered research on rare diseases and rare events.
2.2 European Landscape on the Re-use of Health Administrative Data
The BridgeHealth consortium (http://www.bridge-health.eu) has established a list of European projects able to yield indicators valuable for stakeholders to make decisions in public health, like Euro-Peristat (http://www.europeristat.com/) in perinatal health [[7]]. Other projects like ECHO [[8]] and Euro-HOPE [[9]] focus on healthcare performance comparisons. More focused information systems and registries were created for specific conditions like injuries (http://ec.europa.eu/health/data_collection/databases/idb/), and diabetes (http://www.eubirod.eu/), or specific types of data, such as Diagnostic Related Groups (http://eurodrg.eu/). Finally, projects such as SHARE, EHLEIS or INEQ-CITIES focus on the social determinants of health by collecting data from several cities [[10]].
The heterogeneity of information across Europe has been a major concern to conduct cross-national studies. Projects like EU-ADR [[11]], EUROmediCAT [[12]], PROTECT [[13]], and EMIF [[6]] have developed their own solutions, mainly based on distributed networks and common data models like OMOP [[14]]. Thus, several projects in Europe allow cross-national information. However, there is currently no pan-European solution or comprehensive project addressing the access to health administrative data for research.
3 Accessing Health Data for Research in Europe
This section provides an overview of the infrastructures put in place by some European countries and the different models implemented to provide access to health administrative data for research purposes. In several countries, the infrastructures are not limited to health administrative data (strictly speaking) but rather extended to other data generated in routine care (including patient records), or produced for public health (like registries). The publications and projects described in this paper exemplify the potential of such works for research in Public Health, but are not based on a systematic review.
3.1 Nordic Countries
Nordic countries, i.e., Denmark, Sweden, Norway, Finland, and Iceland, have a common history with similar political systems and a national level taxpayer financed health insurance coverage. In Sweden, Norway, and Denmark, national health registries are administered by a single authority and contain information covering the entire population regarding diverse aspects of population health: cancer, in and out patient information, drug prescription, causes of death, and biobank information. Administrative data on migration and socioeconomic status are also collected at a national level, with data available since the 19th century and follow-up of individuals from birth or immigration, to death or emigration [[15],[16]]. National health registers are available for public research. Private entities can also access data through public research institutions for research purposes. Denmark is currently ahead in the digitized access to data. Danish public researchers can indeed access all health registers linked together via an online tool called ForskerMaskinen or the ResearchEngine. Researchers can thus access the entire set of health registers of the Danish population online, although exports are limited to the sole results and aggregated data. Moreover, the joint Nordic project Tryggve is developing solutions to share data across national borders (https://neic.nordforsk.org/2016/12/13/tryggve-takes-care-of-sensitive-data.html).
An advantage of the Nordic registers in terms of accessibility is that different data sources can be linked via personal identification numbers that each individual is assigned to by birth or immigration. Dataset for analysis can therefore include very comprehensive data on a patient trajectory within the health system. As an example, in one study conducted within the Euro-Peristat project, three quarters of the evidence found were derived from Nordic countries data [[7]]. Researchers in Nordic countries can access data to set up nationwide patient cohort studies, adequately powered in terms of sample size and follow-up, from birth to death, in order to study the role of a large panel of health determinants [[17], [18]].
3.2 The United Kingdom
The care.data programme was set up to create a central database covering the overall English population (https://www.england.nhs.uk/ourwork/tsd/care-data/). However, it was cancelled after substantial concerns expressed by the public and general practitioners [[19], [20]]. The Clinical Practice Research Datalink (CPRD) is an initiative related to the care.data programme. It evolved from the General Practice Research Database (GPRD) to collate data from a large number of general practices in England and link this data to hospital data, around 50 disease registries and clinical audits, UK Biobank with genetic information and the loyalty cards of a large supermarket chain. CPRD currently includes data from over 600 general practices and over 10 million patients. It is very widely used internationally for epidemiological research [[21]], and even for clinical trials. This includes the recruitment of patients through their electronic health records (EHRs) for a pharmacogenetic study of statin-induced myopathy [[22]], the conduct of cluster randomised trials [[23]], and pragmatic point-of-care trials randomising patients between different standard treatments [[24]].
Other initiatives in the UK include (i) the Welsh Secure Anonymous Information Linkage System (SAIL) with a platform that operates a remote access system providing a secure data access to approved users and data analysis tools [[25]], (ii) the Scottish Health Informatics Programme (SHIP)[[26]], and (iii) the Connected Health Cities (CHC) programme (https://www.connectedhealthcities.org/) in the North of England. In SAIL, SHIP and CHC, there are substantial public engagement programmes aiming at better understanding the publics’ preferences and concerns related to the sharing of health data. Transparency is indeed crucial about information security, dynamic consent with the ability to opt-out of some specific uses of data, scientific transparency and reliability, and systematic public engagement [[27]].
3.3 France
There are more than 900 available databases for research in France (https://epidemiol-ogie-france.aviesan.fr/en/epidemiology/home). Among them, CepiDC is a national database that gathers causes of death, and the National Health Insurance Inter-Scheme Information System (SNIIRAM) is a database managed by the French National Health Insurance Fund (CNAMTS) holding information on more than 65 million individuals, including all reimbursement data for medical consultations, drugs and procedures in ambulatory medicine, and hospital discharge summaries. Researchers can access the data after being accredited through a specific process. The number of publications based on this data has increased in the last years, covering topics in epidemiology and pharmacoepidemiology [[28]], health care costs and health care assessment [[29]]. Several publications were issued from the CNAMTS itself, estimating the medical and economic burden of chronic diseases, and analysing social determinants of health [[30]], and comparing SNIIRAM to dedicated registries [[31]].
However, several obstacles still limit the access to data. An internal survey conducted by INSERM in 2015 (http://www.europeristat.com/images/Paris_2016_meeting/Collaborations/burgun-5apr16.pdf) showed that an average of 18 months and no less than 16 formal steps were needed to access such data, and approvals for record linkage were even more difficult to obtain. This system has moved in 2016 to a more inclusive data repository, with claims data, hospital coded data sets, and causes of death, called the National Health Data System (SNDS, for “Système National des Données de Santé”). INSERM is setting up an infrastructure to facilitate the access to the SNDS for research while preserving a high level of security and privacy.
3.4 Italy
Each Italian Region routinely collects information to be shared with the central government on healthcare activities. The use of regional data for research purposes is usually granted within specific projects. It is supported by different IT solutions, e.g., the Lumbardy Region is studying a solution similar to the Danish Forskermaskine. Data collected through the New National IT System (NSIS) is used to evaluate the quality and appropriateness of healthcare services. Several informative flows centred on both the provider or the citizen about hospitalizations, ambulatory care, and socio-sanitary activities are stored in the NSIS. Moreover, the Italian National Institute of Statistics (ISTAT) maintains longitudinal databases for monitoring and evaluating the healthcare system and citizens’ health conditions. While the NSIS is mainly based on health administrative data, other databases are used for research purposes, among which those maintained by Health Search, the research institute of the Italian Association of General Practitioners, with more than 100 publications (https://healthsearch.it/), and Pedianet, a project promoted by an independent network of Italian paediatricians (http://www.pedianet.it/en/), with also a great number of publications, e.g., [[32]]. Administrative data, collected at the regional level, are at the basis of an important evaluation program at the national level, the National Outcome Project (PNE), which includes the evaluation of healthcare processes, interventions, providers, and access to healthcare services. Moreover, those data are at the basis of several research projects and publications [[33]], mostly in pharmaco-epidemiology, health care management, and epidemiology, including studies on diseases with very low incidence [[34]].
3.5 Spain
The Spanish Ministry of Health (MSSSI) in coordination with the 17 Regional Health Departments has the responsibility of the collection of health data and the maintenance of health information systems for the Spanish National Health System (SNHS). A few certified data sources allow nation-wide public health monitoring and research.
The Primary Care Dataset [BDCAP] collects clinical data from a random sample of primary care episodes including information on active health problems, interventions, and intermediate health outcomes. BDCAP aims at studying the effectiveness, quality, and cost of primary care. BDCAP is curated and maintained by the MSSSI and it manages 2.79 million electronic health records (EHRs).[
1
]
Primary Care Drugs Prescription Dataset [BIFAP] aims at carrying out pharmaco-epidemiological research in real life conditions. The information is voluntarily collected by physicians working in primary care settings. BIFAP includes clinical and prescription data from 4.8 million patients. BIFAP is fostered by the Spanish Agency of Medical Products and Devices, a public agency of the MSSSI.[
2
]
The Atlas of Variations in Medical Practice (Atlas VPM) collates all hospital admission data since 2002 in the SNHS, along with demographic, socioeconomic, and supply data. AtlasVPM aims at eliciting unwarranted differences in healthcare performance (i.e., effectiveness, quality and safety, and efficiency) across all SNHS healthcare providers. Atlas VPM, who currently manages 70 million hospital episodes, is curated and maintained by the Institute for Health Sciences in Aragon (IACS), a public institution of the SNS at regional level. All the research outputs are available in open access at www.atlasvpm.org.[
3
]
Taking advantage of the widespread coverage of EHR in Spain[
4
], several ACs are developing major initiatives on their reuse for public health research and monitoring, e.g., BIGAN in Aragon or PADRIS in Catalonia. Nevertheless, achievements are uneven across the SNHS, partly because of the concerns raised by pioneer initiatives on the inadequate use of EHR (which translated into stricter legal and ethical requirements), partly because of the lack of technical and methodological capacity.
3.6 Germany
The development of health care research in Germany has been late as compared to other countries. Nonetheless, it is supported by considerable funding by the BMBF (Ministry of education and research), institutions like the German Research Foundation, and the industrial sector. Many research centres for public health research were created with funds from BMBF leading to an increased number of high quality publications, e.g., [[38]], a development that was accelerated by the use of data regularly documented by statutory health insurances for research.
Health data providers include health insurance funds, institutes like the German Institute of Medical Documentation and Information (DIMDI), and different research facilities. Gaining access to data for research purposes may vary and be performed either per request, use contract, or project-specific agreements. The GKV-Versorgungsstrukturgesetz built the legal basis for the reuse of routine data collected by the statutory insurances (GKV) at DIMDI. DIMDI’s Information System for Health Care Data started its pilot operation in 2014, and its data have been used in several research projects. Data collected from health insurance companies directly contributed to the German Pharmacoepidemiological Research .Database (GePaRD), whereby local health insurance data could be linked to death certificate data [[39]].
The German Network of Health Care Research has compiled best practices for health care research [[40]] in a series of memoranda that had a considerable influence on the research community. However, German public health research has still a lot to gain from initiatives to establish a modern Information Technology (IT) infrastructure for combining resources and enabling research with health routine data.
In summary, although we focused on a limited number of countries, we showed that health administrative data is abundant in Europe but access to this data for researchers is constrained by complexity and heterogeneity.
4 FAIR Principles: Impact for Public Health Informatics
Recently, a set of stakeholders—representing academia, industry, funding agencies, and scholarly publishers—has established a set of principles that we refer to as the FAIR Data Principles: data must be Findable, Accessible, Interoperable, and Reusable [[41]]. The FAIR principles put a specific emphasis on enhancing the ability of computers to automatically find and use the data, not only on supporting its reuse by individuals. Behind FAIR principles is the notion that algorithms are used to search for relevant data sources, to analyse the data sets, and to mine the data for scientific discovery. Therefore, the principles apply not only to the ‘data’, but also to the algorithms, tools, and workflows. The public health community will benefit from the application of the FAIR principles for:
-
Data discovery (know which data is available and how to access the data),
-
Accessing to aggregated data at the European level,
-
Accessing to individual data, based on a secure environment respecting national and European policies on data protection,
-
Services for hypothesis-driven studies through shared models and algorithms,
-
Services to enable knowledge discovery with big data approaches.
Moreover, to enable data reuse, the public health informatics community must pave the road and provide (i) codes of practice and (ii) semantic frameworks that sources and users should agree upon in a near future.
5 Legal Framework
The EU legal framework consists of Regulations, Directives, Decisions, Recommendations and Opinions. The aim of the EU legal framework for public health is defined in a single statement by the EU Treaty of Lisbon [[42]]: “A high level of human health protection shall be ensured in the definition and implementation of all Union policies and activities.” This rule means that public health aspects should be included in all EU policies. This objective is to be achieved through community support to the member States. However, the regulation of healthcare systems continues to be the responsibility of the member states, and there exists a patchwork of different laws and policies [[43]], and a multitude of Health Acts, Social Security Funding Acts, Hospital Laws, Cancer Acts, Human Tissue Acts, Data Protection Acts, Health and Medicines Acts, Codes of Professional Conduct, and Codes of Medical Ethics. This patchwork of laws has created challenges for data sharing across jurisdictional borders. The third EU Health Programme (2014-2020) aims at supporting and adding value to the policies of Member States to improve the health of EU citizens, reduce health inequalities, and protect citizens from cross-border health threats (Reg. 282/2014). These efforts are supported by the regulation on European statistics on health (Reg. 1338/2008, 2015/359), and the establishment of a European network for the epidemiological surveillance and control of communicable diseases (Dec. 1082/2013, 2014/504, 2002/253).
The European Commission is unifying personal data protection with the General Data Protection Regulation (GDPR) [[44]], which will replace the Data Protection Directive 95/46/EC in 2018 thus imposing new obligations on organizations that process personal data of EU residents, with a major impact on research involving health data [57]. The GDPR is an attempt to reconcile the often competing requirements of privacy protection and innovations in data management: (i) organizations that process personal health data for research purposes have to implement appropriate safeguards; (ii) under certain conditions, they may override a subject’s right to object to the processing of his/her data or to seek the erasure of his/her personal data (Art. 89). It is an exemption to the principle of purpose limitation for research through Art. 5(1)(b), 89(1), and 9(2)(j).
Public health research is treated as a subset of scientific research (Rec. 159). Therefore, the same exemptions and requirements apply here. However, the GDPR also contains several provisions applicable exclusively to public health research and covering (i) higher protections for the processing of sensitive data for health-related purposes; (ii) transfer of personal data to third countries if transfer is necessary for important reasons of public interest, e.g., contagious diseases; (iii) stronger requirements to consult supervisory authorities about processing activities.
Public health research must pay special attention to legal issues [[45]], e.g., a registry data owner’s country legislation may forbid cross-border exchange of specific data. Moreover, a risk assessment has to be considered concerning the risk of re-identification from individual data. This risk depends on the nature and availability of contextual information, and also on IT capabilities [[46]]. Unlike the United States Health Insurance Portability and Accountability Act (HIPAA), which exempts data from regulation if 18 specific personal identifiers are removed, GDPR considers data as anonymous only when it cannot be identified by any means “reasonably likely to be used (...) either by the controller or by any other person” (Rec. 23). Thus, even if a user is neither able nor willing to re-identify individual data, a data set may still fall under the GDPR if it could be re-identified “with reasonable effort”. Public health research must build and maintain trust and this can only be done when employing the best state-of-the-art IT techniques for the protection of sensitive data against misuse.
6 Discussion and Conclusion
Funding and governmental agencies require adequate data management plans for data generated by publicly funded research. Beyond proper collection, annotation, and archival, data stewardship includes the notion of FAIR data, with the goal that the data should be available to be re-used for downstream investigations [[41]]. Conditions that simplify discovery, evaluation, and reuse include the ability to implement shared metadata, algorithms, and services. Until now, most European projects in public health have developed their own, mainly short-term, project-specific solutions resulting in the need for addressing more far-reaching interoperability and data sharing issues.
The notion of reference datasets already exists for certain domains, like genomics. Some efforts have been made to deliver such data sets in public health, e.g., the French “Echantillon Généraliste des Bénéficiaires” is an anonymous sample extracted from SNIIRAM. However, conversely to standardized high-throughput data in genomics, public health data is mainly made of heterogeneous and low-throughput data. The resulting ecosystem for health data might become more diverse, thereby exacerbating the discovery, interoperability, and re-usability issues, not only by humans, but also through the automatic processing by computers. In this article, we put an emphasis on health administrative databases, which bring even more difficulties.
Following the example of the Nordic countries, several European countries aim at facilitating the re-use of their health administrative databases for research. This will increase scientific knowledge in a number of areas where the power of large databases is needed. However, only sophisticated IT methods can ensure appropriate data protection in that case.
Most of the advances in public health are being driven when data of high quality, combining population genomics, routine care data, biobank annotations, costs, and other sources, are combined to more traditional public health data. The convergence of several national initiatives in Europe, the adoption of the FAIR principles, the increasing number of studies reusing routine data, and the experience gained by the medical informatics community in addressing interoperability issues, all are incentives to develop a common public health research infrastructure in Europe. An infrastructure able to provide secure and efficient access to health administrative data would provide without doubt a boost for the research capacities of existing bio-medically-focused research infrastructures.