Subscribe to RSS
DOI: 10.1055/a-1976-2371
Definition of a Practical Taxonomy for Referencing Data Quality Problems in Health Care Databases
- Abstract
- Introduction
- Objectives
- Methods
- Results
- Discussion
- Conclusion
- References
Abstract
Introduction Health care information systems can generate and/or record huge volumes of data, some of which may be reused for research, clinical trials, or teaching. However, these databases can be affected by data quality problems; hence, an important step in the data reuse process consists in detecting and rectifying these issues. With a view to facilitating the assessment of data quality, we developed a taxonomy of data quality problems in operational databases.
Material We searched the literature for publications that mentioned “data quality problems,” “data quality taxonomy,” “data quality assessment,” or “dirty data.” The publications were then reviewed, compared, summarized, and structured using a bottom-up approach, to provide an operational taxonomy of data quality problems. The latter were illustrated with fictional examples (though based on reality) from clinical databases.
Results Twelve publications were selected, and 286 instances of data quality problems were identified and were classified according to six distinct levels of granularity. We used the classification defined by Oliveira et al to structure our taxonomy. The extracted items were grouped into 53 data quality problems.
Discussion This taxonomy facilitated the systematic assessment of data quality in databases by presenting the data's quality according to their granularity. The definition of this taxonomy is the first step in the data cleaning process. The subsequent steps include the definition of associated quality assessment methods and data cleaning methods.
Conclusion Our new taxonomy enabled the classification and illustration of 53 data quality problems found in hospital databases.
#
Introduction
In health care organizations, software packages and tools routinely generate and/or record huge volumes of data while they help users to perform their work. For example, software tools record the patients' stays in care units (for administrative purposes), laboratory test results (for optimizing diagnosis and treatment), and data from surgical theaters (to monitor the quality of care).[1]
In most hospitals, these operational applications have been implemented for several years now and may provide significant volumes of data of great value.[2] Indeed, several data reuse initiatives have been undertaken,[3] [4] [5] [6] [7] [8] [9] [10] to discover new knowledge,[11] screen patients prospectively for inclusion in clinical trials,[12] [13] provide physicians with teaching support,[14] and facilitate clinical research.[3] [4] [5]
However, the potential reuse of data is not always taken into account when databases are implemented and operated. For example, operational databases contain errors due to user input errors, poor documentation, measurement artifacts,[15] [16] [17] and inter-database differences in structure. As a result, the exploitation of these data can give erroneous results.[5] [6] Data cleaning is one way of dealing with data quality problems.[5] [7] [12] Data cleaning typically comprises four main steps. First, it is mandatory to assess the quality of the source data; this assessment also provides an opportunity to judge the usefulness and accuracy of the software and the corresponding database. The various data quality problems can be related to the application's use, the database's design, the application's settings, and so on. Second, data cleaning processes are selected to address the detected data quality problems. Some data quality problems can be mastered and will not compromise the data reuse. Lastly, the data cleaning processes are implemented and the data are then re-evaluated (to measure the impact of cleaning).
To the best of our knowledge, a comprehensive taxonomy of data quality problems is currently lacking. Ideally, a taxonomy should (1) address all possible types of technical problems (i.e., from a single record to multiple data sources, including instances and structures), (2) systematically assess, manage, and improve data quality, and (3) facilitate the development of solutions that are quickly to implement and share.
#
Objectives
Here, we focused on the first step of the data cleaning process: the assessment of data quality. More precisely, we sought to classify technical problems and excluded data privacy, access, and security issues. The objectives of the present study were to define an operational taxonomy for data quality problems in operational health care databases, illustrate the taxonomy with concrete examples from clinical databases, and thus facilitate data quality assessments. To this end, we reviewed, summarized, and structured published works in this field.
#
Methods
For the sake of clarity and consistency, we applied the following names and definitions throughout the present manuscript. A database corresponds to a source of data (a data source or a source system) and is composed of several tables (also referred to as relations). A table stores rows (also referred to as tuples, lines, or records) characterized by various columns (attributes, fields, or variables). A value is stored in a cell at the intersection of a row and a column. Values in a given column can follow a predefined format (a syntax rule, grammar, or standardized format, e.g., “YYYY/MM/DD” for a date). A data quality problem is defined as schema-related when the problem arises from the data structure or, on the contrary, as instance-related when linked to the value itself (independently of its data type).
We applied a three-step method. The first step consisted in drawing up an inventory of data quality problems and their classifications, according to the literature. The second step consisted in structuring these problems in the most efficient way. Lastly, each data quality problem was illustrated with fictitious but realistic examples.
Review of Published Works
We searched the scientific literature for peer-reviewed English- or French-language publications containing a list or a taxonomy of data quality problems. The other main inclusion criterion was a practical definition and/or illustration for identifying data quality problems, preventing their occurrence, or lessening their impact.
The IEEE Xplore Digital Library, Springer, Science Direct, and MEDLINE via PubMed databases (time period: 1979–2022) were searched with terms “data quality problems,” “data quality taxonomy,” “data quality assessment,” or “dirty data” (applied to the title, abstract, keywords, and/or full text). The search results (containing each publication's title, author(s), journal, and digital object identifier) were exported to an Excel file (Microsoft Corporation, Redmond, Washington, United States). Two reviewers (P.Q. and A.L.) independently checked the publication titles and abstracts against the selection criteria and then screened the selected full-text articles. Any discrepancies were resolved through discussion with S.D., R.P., J.S., R.M., N.M., and M.F., until a consensus was reached. The search was extended by searching in the references of the included documents.
For each publication, we extracted the taxonomy's objective, the data quality problems defined or illustrated, and the classification used to present these quality problems.
#
Organization and Implementation of the New Taxonomy
First, existing taxonomies were compared; we then chose the most comprehensive, intuitive taxonomy for the assessment of data quality in operational databases. Next, the data quality problems extracted from the reviewed publications were implemented in the chosen taxonomy. Data quality problems of the same type were grouped together. For example, a problem identified for a given field (e.g., missing date of birth data) was grouped with other similar problems (e.g., missing weight data). Likewise, identical problems defined for different types of data were grouped together. We also identified similar data quality problems that occurred at different levels of granularity (e.g., violations concerning one tuple, many tuples, or many tables). When necessary, we harmonized the taxonomy's structure. For example, the same data problem could be defined with a positive or negative sentence (e.g., compliance or noncompliance with an integrity constraint); here, we chose to define the problem as the negative event.
#
Illustration of the New Taxonomy
The items selected for the new taxonomy were specified according to the classification. For the sake of clarity, each data quality problem was documented by a short title, a definition, and an illustration. Illustrations were created by transposing the defined problems to two realistic (but fictitious) hospital databases: an anesthesia information management system (AIMS) and an administrative software (ADMIN) that deals with hospital stays (steps, duration, diagnoses, and medical procedures). Simplified models of the data in these databases are shown in [Fig. 1]. Only tables storing facts (e.g., hospital stays and drug administrations) were selected; we omitted tables storing vocabularies (e.g., taxonomies of drugs, diagnoses, and medical acts). The AIMS database comprised five main tables: PATIENT (patient information), INTERVENTION (information on surgery), MEASUREMENT (monitoring), DRUG (drug administration), and EVENT (various events in anesthesia management). Two-dimensional tables were added to illustrate certain data quality problems. The ADMIN database also comprised five main tables: PAT (patient information), HOSPITAL_STAY (from admission to discharge), UNIT_STAY (details of stays in specific units), MEDICAL_PROCEDURE, and DIAGNOSIS. All the individuals portrayed in the illustrations were fictitious.
![](https://www.thieme-connect.de/media/10.1055-s-00035037/20230102/thumbnails/10-1055-a-1976-2371-i22020012-1.jpg)
![Zoom Image](/products/assets/desktop/css/img/icon-figure-zoom.png)
#
#
Results
The Literature Search
A total of 1,856 publications were identified in the initial literature search (IEEE Xplore Digital Library: n = 410; Springer: n = 784; Science Direct: n = 259; MEDLINE via PubMed: n = 403). We excluded 177 duplicate publications; hence, a total of 1,679 publication titles and abstracts were screened for relevance. After the first round of screening, 221 publications met our inclusion criteria. In the second round, 209 were excluded and so 12 publications were included. The main reasons for exclusion were a lack of data quality problems or publication in a language other than English or French. The selected publications' main characteristics are summarized in [Table 1]. Ultimately, 286 instances of data quality problems were extracted from the 12 publications. Although these instances were classified according to six distinct hierarchies in their original taxonomy, they were often similar, even if their wording was heterogeneous. The disparities in structure and in wording were mainly due to differences between the taxonomies' respective purposes (e.g., the evaluation of data quality tools).
#
Structure of the Taxonomy
We chose a taxonomic structure ([Fig. 2]) based on the definitions given by Oliveira et al[18] and Rahm and Do.[17] This structure organizes data quality problems according to the corresponding database's levels of granularity. This approach makes it easier to review database objects and related problems occurring at the single column level, single row level, or multiple data source level. However, Oliveira et al focused on instance-related problems and excluded schema-related problems. Like Rahm and Do, we completed the structure by adding data quality problems related to the database's schema for each level of granularity.
![](https://www.thieme-connect.de/media/10.1055-s-00035037/20230102/thumbnails/10-1055-a-1976-2371-i22020012-2.jpg)
![Zoom Image](/products/assets/desktop/css/img/icon-figure-zoom.png)
The classifications implemented by Kim et al,[19] Li et al,[20] and Barateiro et al[21] referred to the root causes of data quality problems (e.g., integrity through transaction management), the impact of quality problems on data reuse (e.g., documented, not-wrong data that were nevertheless not usable), and ways of avoiding data quality problems (e.g., by enforcing database constraints).[16] [17] [22] Gschwandtner et al present a taxonomy related to time-oriented data and that therefore failed to cover all the identified data quality problems.[23] Weiskopf et al, Wang et al, and Diaz-Garelli et al developed lists of data quality problems but did not rank or organize them.[24] [25] [26] Kahn et al and Henley-Smith et al organized data quality problems according to three dimensions of data quality: conformance, completeness, and plausibility.[27] [28] We did not implement Müller's structure because it does not take into account of quality problems related to multiple sources.[29]
#
Elements of the Practical Taxonomy
After gathering together similar items from the 12 sources, the practical taxonomy comprised 53 items ([Table 3]). For each category in the taxonomy, an example of data quality problem is fully documented below. In each example, the data quality problem is underlined.
Single Column of a Single Row
-
Data quality problem: missing value.
-
Definition: the value of a cell is null.
-
Example: in a row of the table PATIENT, the column BIRTH_DATE has a null value.
PATIENT (PATIENT_ID = 44908, INPATIENT_IDENTIFIER = “1001982736,” FIRST_NAME = “JOSIANE,” LAST_NAME = “DEWALLE,” MARITAL_NAME = “ROSEY,” BIRTH_DATE = “,'' …)
#
A Single Column in Multiple Rows
-
Data quality problem: unique value violation.
-
Definition: a column has the same value in different rows, whereas it is supposed to be unique.
-
Example: in the PATIENT table, the following rows have the same inpatient identifier.
PATIENT (PATIENT_ID = 102310, INPATIENT_IDENTIFIER = ”1002392301”, ...)
PATIENT (PATIENT_ID = 104913, INPATIENT_IDENTIFIER = ”1002392301”, ...)
#
Multiple Columns in a Single Row
-
Data quality problem: wrong derived-field data.
-
Definition: a column calculated from other field columns shows an incorrect result.
-
Example: in a row of the DRUG table, the TOTAL column does not correspond to the product of the dosing frequency, the dose level, and the time period during which the drug is administered (expected value = 5).
DRUG(INTERVENTION_ID = 134454, DRUG_ID = 180, ADMINISTRATION_DATE = “2012/12/14 08:20:05,” END_DATE = “2012/12/14 10:45:50,” POSOLOGY = 2, POSOLOGY_UNIT = “mL/h,” CONCENTRATION = 1, CONCENTRATION_UNIT = “mg/mL,” TOTAL = 7, TOTAL_UNIT = ”mg”)
#
Single Table
-
Data quality problem: violation of a business domain constraint.
-
Definition: in a table, a row does not comply with a business domain constraint linked to another row.
-
Example: in the EVENT table, the date of the “End of surgery” record is prior to the date of the “Start of surgery” record.
EVENT (INTERVENTION_ID = 250931, EVENT_ID = 158, EVENT_DATE = ”2014/01/02 13:21:04,”…)
EVENT (INTERVENTION_ID = 250931, EVENT_ID = 159, EVENT_DATE = ”2014/01/02 13:21:04,”…)
#
Multiple Tables
-
Data quality problem: referential integrity violation.
-
Definition: the value of a foreign key does not reference any rows in the table of the primary key.
-
Example: the column UNIT_ID in the INTERVENTION table references the row characterized by the primary key UNIT_ID = 215 from the STRUCTURE table, whereas this row is missing in the STRUCTURE table.
INTERVENTION (INTERVENTION_ID = 219321, …, UNIT_ID = 214, ...)
UNIT (UNIT_ID = 213, ...)
UNIT (UNIT_ID = 215, ...)
#
Multiple Data Sources
-
Data quality problem: synonyms in schema objects.
-
Definition: different names are used for the same object.
-
Example: in databases 1 and 2, the “patient” object is represented by two different table names, respectively PATIENT and PAT.
-
Source:[18].
#
#
#
Discussion
The objective of the present study was to define a practical taxonomy of technical data quality problems in operational databases. To this end, we reviewed the literature on published taxonomies. Instances documented in the selected taxonomies were gathered together and organized into a new, practical taxonomy. Each item of the new taxonomy was fully documented and illustrated with typical examples from operational health information systems.
By adopting a bottom-up approach, this taxonomy facilitates the systematic assessment of data quality in databases; it presents quality problems according to the data's granularity. In this way, exploration of the database structure can range from the most elementary structure (the value stored in a single cell) to more complex situations (data recorded by multiple sources). Moreover, we chose to combine schema- and instance-related problems in the same taxonomy. Lastly, each data quality problem was systematically illustrated with a (fictitious) example from a clinical database.
The main limitation of the new taxonomy is its scope; we focused solely on data quality problems in operational databases and did not consider data quality problems in data warehouses or after data cleaning.[30] Similarly, data quality problems related to authorization, accessibility, and security were not considered.[31] However, the taxonomy can be extended accordingly.
The items in our taxonomy are generic templates that must be implemented on the evaluated database, depending on the constituent tables and columns. For example, the missing value SCSR1 template could be instantiated with all columns for which a value is mandatory, as suggested by Diaz-Garelli et al and Wang et al.[24] [25] Each data quality problem defined in the taxonomy could be completed with its incidence when assessing the data, as suggested by Henley-Smith et al.[27] Depending on the incidence and characteristics of the quality problems, one might also be able to give a criticality score for each data quality problem or an overall score for each data quality dimension, as defined by Weiskopf et al.[26]
Once the quality problem has been detected, the focus should be on its cause and potential measures for preventing its occurrence in the source systems. It is useful to provide the software's developers and users with feedback during this step, to increase the data quality upstream of data storage.[30] The definition of assessment methods was outside the scope of the present study. Several data quality problems can be assessed by published automatic methods, whereas others always require manual analysis.[32] A further step in our research would be to link the data quality problems defined in our taxonomy to the appropriate assessment methods. Furthermore, data quality problems could be matched to the corresponding data cleaning methods. Lastly, we intend to assess our taxonomy with new data sources.
#
Conclusion
Based on the data quality problems reported in the literature, we defined a new taxonomy and illustrated it with 53 data quality problems from hospital databases. This taxonomy could be used to assess data quality problems during data reuse.
#
#
Conflict of Interest
None declared.
Acknowledgment
The authors thank David Fraser (Biotech Communication) for English editing.
-
References
- 1 Adler-Milstein J, Nong P. Early experiences with patient generated health data: health system and patient perspectives. J Am Med Inform Assoc 2019; 26 (10) 952-959
- 2 Weng C, Kahn MG. Clinical research informatics for big data and precision medicine. Yearb Med Inform 2016; (01) 211-218
- 3 Weiner MG, Embi PJ. Toward reuse of clinical data for research and quality improvement: the end of the beginning?. Ann Intern Med 2009; 151 (05) 359-360
- 4 Nunez CM. Advanced techniques for anesthesia data analysis. Seminars Anesthesia Perioperative Medicine and Pain 2004; 23 (02) 121-124
- 5 Prokosch HU, Ganslandt T. Perspectives for medical informatics. Reusing the electronic medical record for clinical research. Methods Inf Med 2009; 48 (01) 38-44
- 6 Ebidia A, Mulder C, Tripp B, Morgan MW. Getting data out of the electronic patient record: critical steps in building a data warehouse for decision support. Proc AMIA Symp 1999; 745-749
- 7 Safran C, Bloomrosen M, Hammond WE. et al; Expert Panel. Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. J Am Med Inform Assoc 2007; 14 (01) 1-9
- 8 Meystre SM, Lovis C, Bürkle T, Tognola G, Budrionis A, Lehmann CU. Clinical data reuse or secondary use: current status and potential future progress. Yearb Med Inform 2017; 26 (01) 38-52
- 9 McGlynn EA, Lieu TA, Durham ML. et al. Developing a data infrastructure for a learning health system: the PORTAL network. J Am Med Inform Assoc 2014; 21 (04) 596-601
- 10 Chazard E, Ficheur G, Caron A. et al. Secondary use of healthcare structured data: the challenge of domain-knowledge based extraction of features. Stud Health Technol Inform 2018; 255: 15-19
- 11 Wade TD. Refining gold from existing data. Curr Opin Allergy Clin Immunol 2014; 14 (03) 181-185
- 12 Miller JL. The EHR solution to clinical trial recruitment in physician groups. Health Manag Technol 2006; 27 (12) 22-25
- 13 Cai T, Cai F, Dahal KP. et al. Improving the efficiency of clinical trial recruitment using an ensemble machine learning to assist with eligibility screening. ACR Open Rheumatol 2021; 3 (09) 593-600
- 14 Altman M. The clinical data repository: a challenge to medical student education. J Am Med Inform Assoc 2007; 14 (06) 697-699
- 15 Dentler K, ten Teije A, de Keizer N, Cornet R. Barriers to the reuse of routinely recorded clinical data: a field report. Stud Health Technol Inform 2013; 192: 313-317
- 16 Redman TC. The impact of poor data quality on the typical enterprise. Commun ACM 1998; 41 (02) x
- 17 Rahm E, Do H. Data cleaning: problems and current approaches. IEEE Data Eng Bull
- 18 Oliveira P, Rodrigues F, Henriques P. A formal definition of data quality problems. In:ICIQ.; 2005
- 19 Kim W, Choi BJ, Hong EK, Kim SK, Lee D. A taxonomy of dirty data. Data Min Knowl Discov 2003; 7 (01) 81-99
- 20 Li L, Peng T, Kennedy J. A Rule Based Taxonomy of Dirty Data. In: 2010
- 21 Barateiro J, Galhardas H. A survey of data quality tools. Published online 2005. Accessed June 8, 2022 At: https://www.semanticscholar.org/paper/A-Survey-of-Data-Quality-Tools-Barateiro-Galhardas/1122bf09792b2cd93ef61d9fba24e2cbfd4e8325
- 22 Dasu T, Vesonder GT, Wright J. Data quality through knowledge engineering. In:KDD '03.; 2003
- 23 Gschwandtner T, Gärtner J, Aigner W, Miksch S. A taxonomy of dirty time-oriented data. In:CD-ARES; 2012
- 24 Diaz-Garelli F, Long A, Bancks MP, Bertoni AG, Narayanan A, Wells BJ. Developing a data quality standard primer for cardiovascular risk assessment from electronic health record data using the DataGauge process. AMIA Annu Symp Proc AMIA Symp 2021; 2021: 388-397
- 25 Wang Z, Talburt JR, Wu N, Dagtas S, Zozus MN. A rule-based data quality assessment system for electronic health record data. Appl Clin Inform 2020; 11 (04) 622-634
- 26 Weiskopf NG, Bakken S, Hripcsak G, Weng C. A data quality assessment guideline for electronic health record data reuse. EGEMS (Wash DC) 2017; 5 (01) 14
- 27 Henley-Smith S, Boyle D, Gray K. Improving a secondary use health data warehouse: proposing a multi-level data quality framework. EGEMS (Wash DC) 2019; 7 (01) 38
- 28 Kahn MG, Callahan TJ, Barnard J. et al. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. EGEMS (Wash DC) 2016; 4 (01) 1244
- 29 Müller H, Freytag J. Problems, methods, and challenges in comprehensive data cleansing. Published online 2005. Accessed June 8, 2022 at: https://www.semanticscholar.org/paper/Problems-%2C-Methods-%2C-and-Challenges-in-Data-Mueller-Freytag/0168304c626a5b186bf559bf774a1dca52b04931
- 30 de Almeida WG, de Sousa RD, de Deus FD, Nze GDA, de Mendonça FLL. Taxonomy of data quality problems in multidimensional Data Warehouse models. Paper presented at: 8th Iberian Conference on Information Systems and Technologies; Lisbon, Portugal, June, 19–22, 2013
- 31 Strong D, Lee YW, Wang RY. Data quality in context. Commun ACM Published online 1997
- 32 Woodall P, Oberhofer M, Borek A. A classification of data quality assessment and improvement methods. Int J Inf Qual 2014; 3 (04) 298
Address for correspondence
Publication History
Received: 29 June 2022
Accepted: 02 November 2022
Accepted Manuscript online:
10 November 2022
Article published online:
09 January 2023
© 2023. Thieme. All rights reserved.
Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany
-
References
- 1 Adler-Milstein J, Nong P. Early experiences with patient generated health data: health system and patient perspectives. J Am Med Inform Assoc 2019; 26 (10) 952-959
- 2 Weng C, Kahn MG. Clinical research informatics for big data and precision medicine. Yearb Med Inform 2016; (01) 211-218
- 3 Weiner MG, Embi PJ. Toward reuse of clinical data for research and quality improvement: the end of the beginning?. Ann Intern Med 2009; 151 (05) 359-360
- 4 Nunez CM. Advanced techniques for anesthesia data analysis. Seminars Anesthesia Perioperative Medicine and Pain 2004; 23 (02) 121-124
- 5 Prokosch HU, Ganslandt T. Perspectives for medical informatics. Reusing the electronic medical record for clinical research. Methods Inf Med 2009; 48 (01) 38-44
- 6 Ebidia A, Mulder C, Tripp B, Morgan MW. Getting data out of the electronic patient record: critical steps in building a data warehouse for decision support. Proc AMIA Symp 1999; 745-749
- 7 Safran C, Bloomrosen M, Hammond WE. et al; Expert Panel. Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. J Am Med Inform Assoc 2007; 14 (01) 1-9
- 8 Meystre SM, Lovis C, Bürkle T, Tognola G, Budrionis A, Lehmann CU. Clinical data reuse or secondary use: current status and potential future progress. Yearb Med Inform 2017; 26 (01) 38-52
- 9 McGlynn EA, Lieu TA, Durham ML. et al. Developing a data infrastructure for a learning health system: the PORTAL network. J Am Med Inform Assoc 2014; 21 (04) 596-601
- 10 Chazard E, Ficheur G, Caron A. et al. Secondary use of healthcare structured data: the challenge of domain-knowledge based extraction of features. Stud Health Technol Inform 2018; 255: 15-19
- 11 Wade TD. Refining gold from existing data. Curr Opin Allergy Clin Immunol 2014; 14 (03) 181-185
- 12 Miller JL. The EHR solution to clinical trial recruitment in physician groups. Health Manag Technol 2006; 27 (12) 22-25
- 13 Cai T, Cai F, Dahal KP. et al. Improving the efficiency of clinical trial recruitment using an ensemble machine learning to assist with eligibility screening. ACR Open Rheumatol 2021; 3 (09) 593-600
- 14 Altman M. The clinical data repository: a challenge to medical student education. J Am Med Inform Assoc 2007; 14 (06) 697-699
- 15 Dentler K, ten Teije A, de Keizer N, Cornet R. Barriers to the reuse of routinely recorded clinical data: a field report. Stud Health Technol Inform 2013; 192: 313-317
- 16 Redman TC. The impact of poor data quality on the typical enterprise. Commun ACM 1998; 41 (02) x
- 17 Rahm E, Do H. Data cleaning: problems and current approaches. IEEE Data Eng Bull
- 18 Oliveira P, Rodrigues F, Henriques P. A formal definition of data quality problems. In:ICIQ.; 2005
- 19 Kim W, Choi BJ, Hong EK, Kim SK, Lee D. A taxonomy of dirty data. Data Min Knowl Discov 2003; 7 (01) 81-99
- 20 Li L, Peng T, Kennedy J. A Rule Based Taxonomy of Dirty Data. In: 2010
- 21 Barateiro J, Galhardas H. A survey of data quality tools. Published online 2005. Accessed June 8, 2022 At: https://www.semanticscholar.org/paper/A-Survey-of-Data-Quality-Tools-Barateiro-Galhardas/1122bf09792b2cd93ef61d9fba24e2cbfd4e8325
- 22 Dasu T, Vesonder GT, Wright J. Data quality through knowledge engineering. In:KDD '03.; 2003
- 23 Gschwandtner T, Gärtner J, Aigner W, Miksch S. A taxonomy of dirty time-oriented data. In:CD-ARES; 2012
- 24 Diaz-Garelli F, Long A, Bancks MP, Bertoni AG, Narayanan A, Wells BJ. Developing a data quality standard primer for cardiovascular risk assessment from electronic health record data using the DataGauge process. AMIA Annu Symp Proc AMIA Symp 2021; 2021: 388-397
- 25 Wang Z, Talburt JR, Wu N, Dagtas S, Zozus MN. A rule-based data quality assessment system for electronic health record data. Appl Clin Inform 2020; 11 (04) 622-634
- 26 Weiskopf NG, Bakken S, Hripcsak G, Weng C. A data quality assessment guideline for electronic health record data reuse. EGEMS (Wash DC) 2017; 5 (01) 14
- 27 Henley-Smith S, Boyle D, Gray K. Improving a secondary use health data warehouse: proposing a multi-level data quality framework. EGEMS (Wash DC) 2019; 7 (01) 38
- 28 Kahn MG, Callahan TJ, Barnard J. et al. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. EGEMS (Wash DC) 2016; 4 (01) 1244
- 29 Müller H, Freytag J. Problems, methods, and challenges in comprehensive data cleansing. Published online 2005. Accessed June 8, 2022 at: https://www.semanticscholar.org/paper/Problems-%2C-Methods-%2C-and-Challenges-in-Data-Mueller-Freytag/0168304c626a5b186bf559bf774a1dca52b04931
- 30 de Almeida WG, de Sousa RD, de Deus FD, Nze GDA, de Mendonça FLL. Taxonomy of data quality problems in multidimensional Data Warehouse models. Paper presented at: 8th Iberian Conference on Information Systems and Technologies; Lisbon, Portugal, June, 19–22, 2013
- 31 Strong D, Lee YW, Wang RY. Data quality in context. Commun ACM Published online 1997
- 32 Woodall P, Oberhofer M, Borek A. A classification of data quality assessment and improvement methods. Int J Inf Qual 2014; 3 (04) 298
![](https://www.thieme-connect.de/media/10.1055-s-00035037/20230102/thumbnails/10-1055-a-1976-2371-i22020012-1.jpg)
![Zoom Image](/products/assets/desktop/css/img/icon-figure-zoom.png)
![](https://www.thieme-connect.de/media/10.1055-s-00035037/20230102/thumbnails/10-1055-a-1976-2371-i22020012-2.jpg)
![Zoom Image](/products/assets/desktop/css/img/icon-figure-zoom.png)