RSS-Feed abonnieren
DOI: 10.1055/s-0042-1760247
Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions
Funding This research was partially funded by the Department of Economic Development and Infrastructure of the Basque Government through Emaitek Plus Action Plan Programme.Ane Alberdi is part of the Intelligent Systems for Industrial Systems research group of Mondragon Unibertsitatea (IT1676-22), supported by the Department of Education, Universities and Research of the Basque Country.
Abstract
Background Synthetic tabular data generation is a potentially valuable technology with great promise for data augmentation and privacy preservation. However, prior to adoption, an empirical assessment of generated synthetic tabular data is required across dimensions relevant to the target application to determine its efficacy. A lack of standardized and objective evaluation and benchmarking strategy for synthetic tabular data in the health domain has been found in the literature.
Objective The aim of this paper is to identify key dimensions, per dimension metrics, and methods for evaluating synthetic tabular data generated with different techniques and configurations for health domain application development and to provide a strategy to orchestrate them.
Methods Based on the literature, the resemblance, utility, and privacy dimensions have been prioritized, and a collection of metrics and methods for their evaluation are orchestrated into a complete evaluation pipeline. This way, a guided and comparative assessment of generated synthetic tabular data can be done, categorizing its quality into three categories (“Excellent,” “Good,” and “Poor”). Six health care-related datasets and four synthetic tabular data generation approaches have been chosen to conduct an analysis and evaluation to verify the utility of the proposed evaluation pipeline.
Results The synthetic tabular data generated with the four selected approaches has maintained resemblance, utility, and privacy for most datasets and synthetic tabular data generation approach combination. In several datasets, some approaches have outperformed others, while in other datasets, more than one approach has yielded the same performance.
Conclusion The results have shown that the proposed pipeline can effectively be used to evaluate and benchmark the synthetic tabular data generated by various synthetic tabular data generation approaches. Therefore, this pipeline can support the scientific community in selecting the most suitable synthetic tabular data generation approaches for their data and application of interest.
Keywords
synthetic tabular data generation - synthetic tabular data evaluation - resemblance evaluation - utility evaluation - privacy evaluationPublikationsverlauf
Eingereicht: 13. Juni 2022
Angenommen: 29. Oktober 2022
Artikel online veröffentlicht:
09. Januar 2023
© 2023. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany
-
References
- 1 Rubin DB. Discussion statistical disclosure limitation. J Off Stat 1993; 9 (02) 461-468
- 2 Little RJA. Statistical Analysis of Masked Data. J Off Stat 1993; 9 (02) 407-426
- 3 El Emam K, Hoptroff R. The synthetic data paradigm for using and sharing data. DATA Anal Digit Technol 2019; 19 (06) 12
- 4 Hernandez M, Epelde G, Alberdi A, Cilla R, Rankin D. Synthetic data generation for tabular health records: a systematic review. Neurocomputing 2022; 493: 28-45
- 5 Hu J. Bayesian estimation of attribute and identification disclosure risks in synthetic data. arXiv preprint arXiv:1804.02784, 2018
- 6 Reiter JP. New approaches to data dissemination: a glimpse into the future. Chance 2004; 17 (03) 11-15
- 7 Taub J, Elliot M, Pampaka M, Smith D. Differential Correct Attribution Probability for Synthetic Data: An Exploration. In: Domingo-Ferrer J, Montes F. eds. Privacy in Statistical Databases. Cham: Springer International Publishing; 2018: 122-137
- 8 Yale A, Dash S, Dutta R, Guyon I, Pavao A, Bennett KP. Generation and evaluation of privacy preserving synthetic health data. Neurocomputing 2020; 416: 244-255
- 9 Choi E, Biswal S, Malin B, Duke J, Stewart WF, Sun J. Generating multi-label discrete patient records using generative adversarial networks. Machine learning for healthcare conference. 2017: 286-305
- 10 Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 2002; 16: 321-357
- 11 He H, Bai Y, Garcia EA, Li S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. Paper presented at: 2008 IEEE International Joint Conference on Neural Networks. IEEE World Congress on Computational Intelligence; 2008:1322–1328
- 12 Menardi G, Torelli N. Training and assessing classification rules with imbalanced data. Data Min Knowl Discov 2014; 28 (01) 92-122
- 13 Yang F, Yu Z, Liang Y. et al. Grouped Correlational Generative Adversarial Networks for Discrete Electronic Health Records. Paper presented at: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2019:906–913
- 14 Hernandez-Matamoros A, Fujita H, Perez-Meana H. A novel approach to create synthetic biomedical signals using BiRNN. Inf Sci 2020; 541: 218-241
- 15 Andreini P, Ciano G, Bonechi S. et al. A Two-Stage GAN for High-Resolution Retinal Image Generation and Segmentation. Electronics (Basel) 2022; 11 (01) 60
- 16 Porcu S, Floris A, Atzori L. Evaluation of Data Augmentation Techniques for Facial Expression Recognition Systems. Electronics (Basel) 2020; 9 (11) 1892
- 17 Han C, Hayashi H, Rundo L. et al. GAN-based synthetic brain MR image generation. Paper presented at: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018); 2018:734–738
- 18 Stephens M, Estepar RSJ, Ruiz-Cabello J, Arganda-Carreras I, Macía I, López-Linares K. MRI to CTA Translation for Pulmonary Artery Evaluation Using CycleGANs Trained with Unpaired Data. In: Petersen J, San José Estépar R, Schmidt-Richberg A. et al., eds. Thoracic Image Analysis. Cham: Springer International Publishing; 2020: 118-129
- 19 Dahmen J, Cook D. SynSys: a synthetic data generation system for healthcare applications. Sensors (Basel) 2019; 19 (05) 1181
- 20 Li Z, Ma C, Shi X, Zhang D, Li W, Wu L. TSA-GAN: A Robust Generative Adversarial Networks for Time Series Augmentation. 2021. Paper presented at: International Joint Conference on Neural Networks (IJCNN). Shenzhen, China: IEEE; 2021:1–8
- 21 Che Z, Cheng Y, Zhai S, Sun Z, Liu Y. Boosting Deep Learning Risk Prediction with Generative Adversarial Networks for Electronic Health Records. Paper presented at: 2017 IEEE International Conference on Data Mining (ICDM). 2017:787–792
- 22 Rankin D, Black M, Bond R, Wallace J, Mulvenna M, Epelde G. Reliability of supervised machine learning using synthetic data in health care: model to preserve privacy for data sharing. JMIR Med Inform 2020; 8 (07) e18910
- 23 Hernandez M, Epelde G, Beristain A. et al. Incorporation of synthetic data generation techniques within a controlled data processing workflow in the health and wellbeing domain. Electronics (Basel) 2022; 11 (05) 812
- 24 Kotal A, Piplai A, Chukkapalli SSL, Joshi A. PriveTAB: Secure and Privacy-Preserving sharing of Tabular Data. ACM Int Workshop Secur Priv Anal; 2022
- 25 Bourou S, El Saer A, Velivassaki T-H, Voulkidis A, Zahariadis T. A review of tabular data synthesis using GANs on an IDS dataset. Information (Basel) 2021; 12 (09) 375
- 26 Piacentino E, Guarner A, Angulo C. Generating Synthetic ECGs Using GANs for Anonymizing Healthcare Data. Electronics (Basel) 2021; 10 (04) 389
- 27 Hazra D, Byun Y-C. SynSigGAN: generative adversarial networks for synthetic biomedical signal generation. Biology (Basel) 2020; 9 (12) 441
- 28 Norgaard S, Saeedi R, Sasani K, Gebremedhin AH. Synthetic Sensor Data Generation for Health Applications: A Supervised Deep Learning Approach. Paper presented at: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). 2018:1164–1167
- 29 Wang Z, Myles P, Tucker A. Generating and Evaluating Synthetic UK Primary Care Data: Preserving Data Utility Patient Privacy. Paper presented at: 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS). 2019:126–131
- 30 Beaulieu-Jones BK, Wu ZS, Williams C. et al. Privacy-preserving generative deep neural networks support clinical data sharing. Circ Cardiovasc Qual Outcomes 2019; 12 (07) e005122
- 31 Wang L, Zhang W, He X. Continuous patient-centric sequence generation via sequentially coupled adversarial learning. In: Li G, Yang J, Gama J, Natwichai J, Tong Y. eds. Database Systems for Advanced Applications. Cham: Springer International Publishing; 2019: 36-52
- 32 Rashidian S, Wang F, Moffitt R. et al. SMOOTH-GAN: Towards Sharp and Smooth Synthetic EHR Data Generation. In: Michalowski M, Moskovitch R. eds. Artificial Intelligence in Medicine. Cham: Springer International Publishing; 2020: 37-48
- 33 Yoon J, Drumright LN, van der Schaar M. Anonymization through data synthesis using generative adversarial networks (ADS-GAN). IEEE J Biomed Health Inform 2020; 24 (08) 2378-2388
- 34 Baowaly MK, Lin C-C, Liu C-L, Chen K-T. Synthesizing electronic health records using improved generative adversarial networks. J Am Med Inform Assoc 2019; 26 (03) 228-241
- 35 Goncalves A, Ray P, Soper B, Stevens J, Coyle L, Sales AP. Generation and evaluation of synthetic patient data. BMC Med Res Methodol 2020; 20 (01) 108
- 36 Guan J, Li R, Yu S, Zhang X. Generation of Synthetic Electronic Medical Record Text. Paper presented at: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2018:374–380
- 37 Dash S, Yale A, Guyon I, Bennett KP. Medical Time-Series Data Generation Using Generative Adversarial Networks. In: Michalowski M, Moskovitch R. eds. Artificial Intelligence in Medicine. Cham: Springer International Publishing; 2020: 382-391
- 38 Chin-Cheong K, Sutter T, Vogt JE. Generation of Heterogeneous Synthetic Electronic Health Records using GANs. ETH Zurich, Institute for Machine Learning; 2019
- 39 Hittmeir M, Ekelhart A, Mayer R. On the Utility of Synthetic Data: An Empirical Evaluation on Machine Learning Tasks. Paper presented at: Proceedings of the 14th International Conference on Availability, Reliability and Security. Canterbury CA United Kingdom: ACM; 2019:1–6
- 40 Giles O, Hosseini K, Mingas G. Faking feature importance: A cautionary tale on the use of differentially-private synthetic data. arXiv preprint arXiv:2203.01363, 2022
- 41 Dankar FK, Ibrahim MK, Ismail L. A multi-dimensional evaluation of synthetic data generators. IEEE Access 2022; 10: 11147-11158
- 42 Hittmeir M, Ekelhart A, Mayer R. Utility and Privacy Assessments of Synthetic Data for Regression Tasks. Paper presented at: 2019 IEEE International Conference on Big Data (Big Data). 2019:5763–5772
- 43 Platzer M, Reutterer T. Holdout-based empirical assessment of mixed-type synthetic data. Front Big Data 2021; 4: 679939
- 44 Alaa AM, van Breugel B, Saveliev E, van der Schaar M. How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. International Conference on Machine Learning. 2022: 290-306
- 45 Abay NC, Zhou Y, Kantarcioglu M, Thuraisingham B, Sweeney L. Privacy preserving synthetic data release using deep learning. In: Berlingerio M, Bonchi F, Gärtner T, Hurley N, Ifrim G. eds. Machine Learning and Knowledge Discovery in Databases. Cham: Springer International Publishing; 2019: 510-526
- 46 Wu H, Ning Y, Chakraborty P, Vreeken J, Tatti N, Ramakrishnan N. Generating realistic synthetic population datasets. ACM Trans Knowl Discov Data 2018; 12 (04) 45:1-45:22
- 47 Fowler EE, Berglund A, Schell MJ, Sellers TA, Eschrich S, Heine J. Empirically-derived synthetic populations to mitigate small sample sizes. J Biomed Inform 2020; 105: 103408
- 48 Alqahtani H, Kavakli-Thorne M, Kumar G. Applications of generative adversarial networks (GANs): an updated review. Arch Comput Methods Eng 2021; 28 (02) 525-552
- 49 Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K. Modeling tabular data using conditional gan. Advances in Neural Information Processing Systems; 2019: 32
- 50 Patki N, Wedge R, Veeramachaneni K. The Synthetic Data Vault. Paper presented at: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). 2016:399–410
- 51 The Synthetic Data Vault. Put synthetic data to work! 2022. Accessed January 24, 2022, at: https://sdv.dev/
- 52 SYNTHO. 2022 . Accessed January 13, 2022, at: https://www.syntho.ai/
- 53 The Medkit-Learn(ing) Environment. 2022. Accessed January 24, 2022, https://github.com/vanderschaarlab/medkit-learn
- 54 Build better datasets for AI with synthetic data. 2022. Accessed January 24, 2022, at: https://ydata.ai
- 55 Lee D, Yu H, Jiang X. et al. Generating sequential electronic health records using dual adversarial autoencoder. J Am Med Inform Assoc 2020; 27 (09) 1411-1419
- 56 Park N, Mohammadi M, Gorde K, Jajodia S, Park H, Kim Y. Data synthesis based on generative adversarial networks. Proc VLDB Endow 2018; 11 (10) 1071-1083
- 57 Mendelevitch O, Lesh MD. Fidelity and privacy of synthetic medical data. arXiv preprint arXiv:2101.08658, 2021
- 58 Ghosheh G, Li J, Zhu T. A review of Generative Adversarial Networks for Electronic Health Records: applications, evaluation measures and data sources. arXiv preprint arXiv:2203.07018, 2022
- 59 Hernandez M, Epelde G. Synthetic Tabular Data Evaluation Metrics. 2022. Accessed June 1, 2022, at: https://github.com/Vicomtech/STDG-evaluation-metrics
- 60 Multivariate Distributions – Copulas 0.5.0 documentation. 2022. Accessed March 3, 2021, at: https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html#Gaussian-Multivariate
- 61 Patki N, Wedge R, Veeramachaneni K. “The Synthetic Data Vault.” Paper presented at: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2016:399–410
- 62 Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K. Modeling tabular data using conditional gan. Advances in Neural Information Processing Systems; 2019, 32
- 63 Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC. Improved training of Wasserstein GANs. Adv Neural Inf Process Syst 2017; 30: 5767-5777
- 64 El Emam K, Mosquera L, Hoptroff R. Practical Synthetic Data Generation: Balancing Privacy and the Broad Availability of Data. Ilustrated. O'Reilly Media, Incorporated; 2020
- 65 Strack B, DeShazo JP, Gennings C. et al. Impact of HbA1c measurement on hospital readmission rates: analysis of 70,000 clinical database patient records. BioMed Res Int 2014; 2014: 781670
- 66 Ulianova S. Cardiovascular Disease dataset. Kaggle 2018. Accessed January 26, 2021, at: https://www.kaggle.com/sulianova/cardiovascular-disease-dataset
- 67 Palechor FM, Manotas AH. Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico. Data Brief 2019; 25: 104344
- 68 Machine Learning Repository UCI. Contraceptive Method Choice Data Set. 2022 . Accessed March 14, 2022, at: https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice
- 69 Pima Indians Diabetes Database. 2022 . Accessed March 14, 2022, at: https://kaggle.com/uciml/pima-indians-diabetes-database
- 70 Machine Learning Repository UCI. ILPD (Indian Liver Patient Dataset) Data Set. 2022 . Accessed March 14, 2022, at: https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)