Keywords
Medical data management - data quality assurance
1. Introduction
Epidemiological research in the context of cohort studies and registries requires
extensive personal, medical and socio-economic data often from multi-site acquisition.
Consequently, research becomes more and more complex regarding information acquisition
and exchange between participating institutions as well as data quality assurance.
According to Neugebauer et al. [[1]], a reliable rating of research data quality is only possible within the context
of a specific research question. As an example, Schrappe et al. [[2]] highlight the importance of monitoring the completeness of research data in the
context of registries. Therefore, relevant indicators must be identified to be able
to capture reliable information about data quality with respect to the research question
at hand [[2]]. Stausberg et al. [[3]], have identified a considerable degree of heterogeneity: in their review they compiled
34 different concepts of data quality indicators.
According to the guideline for data quality [[4]], provided by the Technology, Methods and Infrastructure for Networked Medical Research
e.V. (TMF), medical research needs to consider the (a) quality of structures, the
(b) quality of processes as well as the (c) quality of results. Further to (a) selected
variables, quality assessment needs to consider (b) specific data sets and (c) complex
research data pools.
Following the recommendations of this TMF guideline, the monitoring of data quality
should start during the data capture process, “to detect anomalies in research data
efficiently” (see [[4]], p. 119ff). Hereby, the highlighted data quality indicators are of significant
importance.
Focusing on single data variables, data quality monitoring might include the assessment
of missing values (TMF-1012 – TMF-1016), the examination of disallowed values (TMF-1021
– TMF-1026), the distribution of values (TMF-1006), the validation of representability
(TMF-1048), the observation of completeness (TMF-1046) as well as the investigation
of extreme values (TMF-1018) [[4]].
According to recommendations of Müller et al. [[5]], at least a periodical surveillance of distribution characteristics should be applied.
Especially the accuracy as well as the type and frequency of potential recurring data
errors must be an essential part of the investigation. A frequent quality and integrity
problem of research data is the proportion, and distribution of missing entries, which
can substantially complicate the evaluation of research data and the interpretation
of statistical results. Typically, incidental and systematic missings are distinguished,
whereas only systematic missings need to be further categorized regarding quality
aspects.
However, high data quality is connected to costs and knowledge. Especially in smaller
studies, budget and personnel constraints are limiting parameters. Basic statistics
and descriptive plots enable researchers to identify quality problems through visualization
of main characteristics as well as relationships and, therefore, facilitate monitoring
and reporting quality of research data [[6]].
To be able to assess data quality, a scientist needs to have an understanding of (a)
the data and its meaning, (b) basic statistical methods and approaches, (c) methods
of visualization, for example graphs, and (d) at least one statistical software product.
The more knowledge one has the more detailed and conclusive the data can be interpreted.
However, the training period is one of the typical challenges of assuring data quality
and generating statistics. To facilitate a reliable data quality assurance, scientists
have to be able to apply quick checks on current datasets, e.g. regarding data distributions
or the count of characteristics.
Canuel et al. [[7]] analyzed several existing data quality solutions provided by the scientific community.
Five out of seven were open source assessment solutions based on R [[7], [8]], which suggests a widespread application of R in the scientific community. The
authors point out that the developed in-house visualization tools were designed to
fit project-individual needs and provide several highly detailed functionalities.
Reusability of many solutions is therefore limited. The authors conclude that an “increased
development of customizable and reusable tools and libraries would be a great help
for the field” (refer to [[7]], p. 287).
Hence, the aims of the mosaicQA-library are a basic assessment of data quality and
the generation of multiple different graphs without the need for in-depth experience
with statistical methods or statistical software. Within the MOSAIC-project [[9]] (funded by the German Research Foundation (HO 1937/2–1)), a modular systematic
approach to implement a centralized data management, R was used to address this issue
because it is a free and open source powerful statistical software [[8]]. Due to the large user community, R is well-documented. Since the MOSAIC-target
group consists primarily of researchers with limited or no IT-expertise and -resources
[[10]], generating reports on data quality is a particular challenge. Therefore, the aim
of the developed library is to facilitate the generation of quality reports of collected
data reducing the thresholds for non-experts in statistical software. In its most
generic layout mosaicQA focuses on basic statistics and the occurrence of missings.
2. Objectives
This paper focusses on the technical development and implementation of a free R-library
(package) for basic monitoring and reporting of data quality in epidemiological studies.
This library reduces the effort for implementation and maintenance of monitoring and
reporting tools in cohort studies and registries. It is readily reusable for individual
statistical scenarios. In this case, monitoring is considered as a continuous quality
control of research data regarding selected aspects of data quality during the data
collection period, whereas reporting means the visualization of data quality aspects
based on the complete amount of research data at the time of report provision.
3. Methods
3.1 Developing Functions to Monitor and Report Data Quality without “Knowing the Data”
When the development of the mosaicQA-library was initiated, information about the
research data or the respective research context was very limited. Therefore, a general
and flexible approach was demanded. To generate quality reports for various scenarios,
the mosaicQA-library has to be able to handle data, without further information on
specific characteristics. The question on hand is how to evaluate data quality without
knowing their content and meaning. To address this, all quality related aspects have
to be assessed in their most general form before developing generic mechanisms and
applying them to a specific data set. As a result, mosaicQA is designed to produce
valid quality reports for various scenarios and data sets without the need to edit
the underlying R code. Thus, this library enables epidemiological researchers without
previous knowledge of R to gain insights in their data. The first step was to develop
a set of quality-related questions, targeting quality aspects on a more general level:
-
How are the values distributed?
-
What are common statistical values of the distributions?
-
How high is the proportion of missing values compared to findings or other values?
-
Are there outliers in the data set?
-
Is my data model applicable to answer the scientific question?
At this stage metric and categorical data have to be distinguished. Measurements or
other quantifiable data are considered metric, whereas answer options like “yes”,
“no”, “not specified” or “not asked” are examples for categorical data.
As one example, a selected and anonymized sample dataset originating from the GANI_MED-project
[[11]] was used in a next development step to design an R-script for metric and categorical
data (N = 1000). Metric and categorical data require different tools for visualization.
For metric data, basic statistics are, for example, histogram, boxplots and QQ-plots.
An example for the visualization of metric data using a histogram is shown in ► [Figure 1]. Moreover mean values, standard deviation and other basic statistical items have
been added.
Figure 1 Distribution of a metric variable example (sample size limited to N = 1000; Sigma
defined as standard deviation) from the GANI_MED-project [[11]] – visualized with the mosaicQA-package.
A Box-Whisker-Plot presents quantiles and enables the scientist to identify outliers
easily within the dataset. Additionally, the QQ-Plot (Quantile-Quantile-Plot) shows
the deviation of the observed data in comparison with any given probability distribution,
e.g. the normal distribution. The result of such a QQ-Plot is especially useful to
verify whether the observed data fits the applied data model. The QQ-Plot shows the
deviation of research data in relation to the expected, e.g. normal, distribution.
If both graphs are aligned closely that would indicate that this is the case.
Categorical data require absolute and relative distribution frequencies as well as
cumulative percent- ages. Since categorical data refers to different categories, for
example “female” and “male” for gender, tabular representations (with absolute and
relative frequencies) as well as distribution frequency plots can be utilized to reveal
the relation of values (within an expected value range) and qualified missings (exceeding
the expected value range, cf. ► [Figure 2]).
Figure 2 Example plot for a distribution frequency of categorical data displaying valid values
(yes, no) and qualified missings (not specified, not acquired).
For more flexibility and in order to support individual scenarios a third step included
the generalization of the developed R-scripts. This was necessary since the available
data for statistical analysis depend on individual characteristics and parameters
of a study or registry. Therefore, dependencies were resolved, functions were parameterized
and outsourced, and several supporting functions to comfortably individualize labels,
value ranges, code lists and other plot parameters were added.
A matter of utmost importance is the handling of missings. Absent information should
be categorized in terms of “qualified missings” by application of unique codes explaining
why an expected item is not available. Thus, researchers are enabled to determine
in later analyses whether a missing value relates to information not collected, due
to data collection errors during the study period, or is simply a result of:
-
Missing information (the interviewee did not know the answer)
-
Unwillingness to cooperate (the interviewee did not want to answer)
-
Missing coverage by the survey or interview tool (question/response options were not
applicable)
-
Question was not asked or exam was not carried out (e. g. because the participant
is missing the needed body part)
-
Missing documentation (result of a measurement was not documented)
Code lists for qualified missings can be used in the mosaicQA-library to customize
the labels of graphs and tables. Further information on how to handle and code qualified
missings can be found in the “reference manual to describe a data dictionary” available
at the MOSAIC homepage [[12]].
3.2 Functions of the mosaicQA-library
Altogether the developed library provides a set of reusable functions to enable researchers
to simply evaluate research data based on preformatted quality reports. Therefore,
the implemented functions contain the required functionality to generate the previously
described statistical visualizations of research data in a report format. ► [Table 1] presents an overview of the currently implemented functions within the mosaicQA-library
(version 1.2.0).
Table 1
Overview of implemented functionality within the mosaicQA-library (version 1.2.0).
Functions called by the user (User Functions) invoke calls of internal functions.
|
Function
|
Description
|
Call Type
|
|
mosaic.setGlobalMissingTreshold (threshold)
|
Set Global Threshold for Missings, e.g. 99000
|
User
|
|
mosaic.setGlobalUnit(unit)
|
Set Global Unit Label to be used in graphs, e.g. “(cm)”
|
User
|
|
mosaic.setGlobalDescription (description)
|
Set Global Description for variable data, e.g. “waist circumference”
|
User
|
|
mosaic.loadCsvData(filename)
|
Load CSV Data from file, e.g. “c:\data.csv”
|
User
|
|
mosaic.setGlobalCodelist (codelist)
|
Set and parse a global code list for categorical data, e.g. c(“1 = yes”, “2 = no”,
“99996 = no information”)
|
User
|
|
mosaic.createSimple PdfCategorical(inputfile, output-folder)
|
Create simple PDF-file for categorical CSV data using the functions listed above
|
User
|
|
mosaic.createSimplePdfCategorical Dataframe(dataframe, outputfolder)
|
Create simple PDF-file for categorical data frame with n columns using the functions
listed above
|
User
|
|
mosaic.createSimple PdfMetric(inputfile, outputfolder)
|
Create simple PDF-file for metric CSV data with n columns using the functions listed
above
|
User
|
|
mosaic.createSimplePdfMetric Dataframe(dataframe, outputfolder)
|
Create simple PDF-file for metric data frame with n columns using the functions listed
above
|
User
|
|
mosaic.countValue(searchvalue, datacolumn)
|
Count occurrence of search value in data column
|
Internal
|
|
mosaic.preProcessMetricData(data)
|
Pre-process metric data to allow missing-ratio table
|
Internal
|
|
mosaic.preProcessCategoricalData(data)
|
Identify unique values in data column, get absolute, percentage and cumulative statistics
|
Internal
|
|
mosaic.generateMetricTablePlot (data, num of columns, column index, varname)
|
Generate missing-ratio table for metric data
|
Internal
|
|
mosaic.generateMetricPlots (data snippet, varname)
|
Generate graphs for metric data
|
Internal
|
|
mosaic.beginPlot(varname, outputfolder)
|
Begin plotting, generate PDF-file with given variable name
|
Internal
|
|
mosaic.addFootnote()
|
Add a footnote with timestamp and MOSAIC text
|
Internal
|
|
mosaic.finishPlot()
|
Finish plotting, close PDF-file
|
Internal
|
|
mosaic.generateCategoricalPlot (dataframe, varname)
|
Create plots for categorical data
|
Internal
|
|
mosaic.getTimestamp()
|
Get formatted timestamp, e.g. 2015_09_16_235811
|
Internal
|
4. Results
The aims of mosaicQA were to develop a library to provide a basic assessment of data
quality and to generate a set of informative graphs. Especially, there should be no
demand for the potential researcher to master R or any other statistical software.
Based on simple samples the researcher should be enabled to reach the desired goal
with a readily understandable set of commands.
The developed library enables researchers to generate reports for various kinds of
metric and categorical data. Additionally, general reports for multivariate input
data and, if needed, detailed results for single-variable data can be easily produced.
CSV-files as well as data frames can be used as input format to create a report. The
results are instantly saved in an automatically generated PDF-file. For each study
variable within the data input file a separate PDF-file with standard or, if applicable,
customized plots and tables is produced. These standard reports enable the user to
monitor and report the data integrity and completeness. However, for more specific
reports the knowledge of metadata is necessary, including definition of units, variables,
descriptions, code lists and categories of qualified missings.
► [Figure 3] provides a code snippet of a sample script to depict the generation of PDF-reports.
Only four easy-to-use steps have to be performed from data import to report generation:
The researcher has to (a) load the mosaicQA-library, (b) define the data source, (c)
the output folder and (d) has to decide whether metric or categorical reports should
be created. Necessary additional libraries (e.g. gplots, psych, etc.) are installed
automatically. Even users with no experience with statistical software are encouraged
to customize thresholds for qualified missings, units and the description of variables
by calling three further respective library functions (mosaic.setGlobalMissingTreshold, mosaic.setGlobalUnit, mosaic.setGlobalDescription). The necessary data preparation, the customizing of utilized plots and tables and
the required formatting is performed automatically by simply calling the function
mosaic.create-SimplePdfmetric (in case of metric data). As a result a PDF-file is created; containing essential
visualizations to monitor and report the specified data (cf. ► [Figure 4]).
Figure 3 Code snippet to create a report for a metric dataset sample using the R-package “
mosaicQA”.
Figure 4 Applying the code snippet for a metric data variable (cf. ► [Figure 3]) generates a full set of visualizations to facilitate monitoring and reporting of
data integrity and completeness.
The mosaicQA-package was implemented using R (version 3.3.2) and RStudio. The developed
R-package can easily be installed using the official CRAN repository [[13]]. Additional R libraries will be automatically installed if necessary. Additional
sample scripts as well as sample data are available at the MOSAIC- project website
[[14]]. Thus, the provided mosaicQA-library is a ready-to-use implementation, which enables
even non-experts to quickly visualize their data and create own functions using working
examples as a starting point. To simplify the adaption of the script examples every
customizable method is completely documented including an invitation to “specify”,
“set” or “adjust” the given standard values (cf. ► [Figure 3]).
5. Conclusions
The basic idea of a set of reusable functions to simplify the data quality assurance
process is not new. R is an open source solution and supports the implementation of
automated processes in terms of reusable statistical analyses. R has been “blamed
for being a bit rough” (see [[15]], p. 187). Individual R-packages, RStudio and cheat sheets [[16]] are intended to simplify the usage of R especially for non-statistical experts.
The developed library mosaicQA enables users to more easily visualize collected research
data and reduces the need for specific implementations for each epidemiological study
or registry. Computational or scripting knowledge are not required. For mosaicQA different
metric and categorical datasets were used in order to generate appropriate reports
as well as to evaluate the generic characteristics of the applied R methods. The library
is not designed to deal with complex data types like images or biological samples.
Currently, mosaicQA focusses on the data structure quality and supports the assessment
of several quality indicators recommended by the TMF including frequency, distribution
and plausibility of research variables and the occurrence of missing values (TMF-1013
– TMF-1016) and extreme values (TMF-1018) [[4]]. However, in terms of the categories adopted by the TMF guidelines the provided
library only addresses the examination of category (c) selected variables. The present
implementation focusses on the visualization of absolute values. Further implementations
will also support the formula-based, relative TMF quality indicators.
mosaicQA was released as an official R-package for basic monitoring and reporting
of data quality. As a consequence, the active developer community and users in the
translational field are now encouraged to contribute (see [[15]], p. 288). The “library and sample script approach” supports the user to modify
existing functionalities and add new features with minimal effort.
In summary, this article described how the mosaicQA-library was developed and how
the presented functionality can be used to create essential statistics to review data
integrity and completeness. The implemented R-package provides the necessary flexibility,
portability and reusability for usage in various epidemiological research projects.
The current state of the released R-package is a starting point for more advanced
automated processes. Zalatel et al. (see [[15]], p. 191) recommend a continuous controlling of selected data quality indicators
based on routinely implemented processes as a possible strategy to reduce data quality
problems. Thus, the integration in more complex application scenarios is conceivable,
such as the MOSAIC Toolbox for Research [[17]] or tranSMART [[18]]. Consequently, data quality reports could easily be provided automatically and
web-based to scientific partners on a daily basis.
The mosaicQA-library will be optimized continuously to improve the applicability in
the translational field. At the moment, mosaicQA focusses on CSV-based and data frame-based
imports and visualizations for single data variables. To be a basis for a data quality
framework in order to support data exploration as well as the generation of hypotheses
[[7]], the support for complex plots (2-n data variables) is an essential next step.
In addition, further development of mosaicQA shall involve the support for respective
TMF quality indicators to investigate the number of values below and/or above of specified
ranges (TMF-1019) and to identify the number of disallowed values (TMF-1022 – TMF-1025).
The active field of developing data quality monitoring tools can support better processes
in research. The interpretation and management of necessary consequences, however,
remains the obligation of the responsible scientist [[4]].