Evaluating the Impact of Health Care Data Completeness for Deep Generative Models

Benjamin Smith; Senne Van Steelandt; Anahita Khojandi

doi:10.1055/a-2023-9181

Methods of Information in Medicine, Inhaltsverzeichnis

Methods Inf Med 2023; 62(01/02): 031-039
DOI: 10.1055/a-2023-9181

Original Article for Focus Theme

Evaluating the Impact of Health Care Data Completeness for Deep Generative Models

Benjamin Smith‡^*

¹Bredesen Center, University of Tennessee, Knoxville, Tennessee, United States

,

Senne Van Steelandt‡^*

²Department of Business Analytics and Statistics, University of Tennessee, Knoxville, Tennessee, United States

,

Anahita Khojandi

³Department of Industrial and Systems Engineering, University of Tennessee, Knoxville, Tennessee, United States

› Institutsangaben

Abstract

Background Deep generative models (DGMs) present a promising avenue for generating realistic, synthetic data to augment existing health care datasets. However, exactly how the completeness of the original dataset affects the quality of the generated synthetic data is unclear.

Objectives In this paper, we investigate the effect of data completeness on samples generated by the most common DGM paradigms.

Methods We create both cross-sectional and panel datasets with varying missingness and subset rates and train generative adversarial networks, variational autoencoders, and autoregressive models (Transformers) on these datasets. We then compare the distributions of generated data with original training data to measure similarity.

Results We find that increased incompleteness is directly correlated with increased dissimilarity between original and generated samples produced through DGMs.

Conclusions Care must be taken when using DGMs to generate synthetic data as data completeness issues can affect the quality of generated data in both panel and cross-sectional datasets.

Keywords

data quality - data completeness - case completeness - missingness - deep generative models

Volltext

Referenzen

References
1 Chen RJ, Lu MY, Chen TY, Williamson DFK, Mahmood F. Synthetic data in machine learning for medicine and healthcare. Nat Biomed Eng 2021; 5 (06) 493-497
2 Wang Z, Myles P. Tucker Generating and evaluating cross-sectional synthetic electronic healthcare data: preserving data utility and patient privacy. Comput Intell 2021; 37 (02) 819-851
3 Bhanot K, Qi M, Erickson JS, Guyon I, Bennett KP. The problem of fairness in synthetic healthcare data. Entropy (Basel) 2021; 23 (09) 1165
4 Kusner MJ, Paige B, Hernández-Lobato JM. Grammar variational autoencoder. In International Conference on Machine Learning: PMLR. 2017:1945–1954
5 Isola P, Zhu J-Y, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition. 2017:1125–1134
6 Eigenschink P, Vamosi S, Vamosi R, Sun C, Reutterer T, Kalcher K. Deep generative models for synthetic data. 2021
7 Shahrin MH, Wyse L. Deep generative models for musical audio synthesis. In Handbook of Artificial Intelligence for Music. Springer; 2021: 639-678
8 Esteban P, Alvaro G, Cecilio A. Generating synthetic ECGs using GANs for anonymizing healthcare data. Electronics (Basel) 2021; 10: 389
9 Raab GM, Beata N, Chris D. Guidelines for producing useful synthetic data. arXiv e-prints 2017 (e-pub ahead of print)
10 Weiskopf NG, Hripcsak G, Swaminathan S, Weng C. Defining and measuring completeness of electronic health records for secondary use. J Biomed Inform 2013; 46 (05) 830-836
11 Burkhart L, Androwich I. Measuring the domain completeness of the Nursing Interventions Classification in parish nurse documentation. Comput Inform Nurs 2004; 22 (02) 72-82
12 Wright A, McCoy AB, Hickman T-TT. et al. Problem list completeness in electronic health records: a multi-site study and assessment of success factors. Int J Med Inform 2015; 84 (10) 784-790
13 Beaulieu-Jones BK, Moore JH. Missing data imputation in the electronic health record using deeply learned autoencoders. Pac Symp Biocomput 2017; 22: 207-218
14 Angelos K, Apoorv V, Nikolaos P, François F. Transformers are RNNs: fast autoregressive transformers with linear attention. In: International Conference on Machine Learning. PMLR. 2020: 5156–5165
15 Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17) Red Hook, NY, USA: Curran Associates Inc; 2017: 6000-6010
16 Hilsenbeck SG, Kurucz C, Duncan RC. Estimation of completeness and adjustment of age-specific and age-standardized incidence rates. Biometrics 1992; 48 (04) 1249-1262
17 Kodra Y, Posada de la Paz M, Coi A. et al. Data quality in rare diseases registries. Adv Exp Med Biol 2017; 1031: 149-164
18 Reiter JP. Simultaneous use of multiple imputation for missing data and disclosure limitation. Surv Methodol 2004; 30: 235-242
19 Dietterich TG, Kong EB. Machine learning bias, statistical bias, and statistical variance of decision tree algorithms technical report, Department of Computer Science, Oregon State University 1995
20 Wang X, Lyu Y, Jing L. Deep generative model for robust imbalance classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:14124–14133
21 Little RJ, D'Agostino R, Cohen ML. et al. The prevention and treatment of missing data in clinical trials. N Engl J Med 2012; 367 (14) 1355-1360
22 Faris PD, Ghali WA, Brant R, Norris CM, Galbraith PD, Knudtson ML. APPROACH Investigators. Alberta Provincial Program for Outcome Assessment in Coronary Heart Disease. Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. J Clin Epidemiol 2002; 55 (02) 184-191
23 Kelly B, Matthews TP, Anastasio MA. Deep learning-guided image reconstruction from incomplete data. 2017 (e-pub ahead of print)
24 Markey MK, Tourassi GD, Margolis M, DeLong DM. Impact of missing data in evaluating artificial neural networks trained on complete data. Comput Biol Med 2006; 36 (05) 516-525
25 Li SC-X, Bo J, Marlin B. MisGAN: learning from incomplete data with generative adversarial networks. 2019 (e-pub ahead of print)
26 Hu J, Olanrewaju A, Wang Q. Multiple Imputation and Synthetic Data Generation with NPBayesImputeCat. The R Journal 2021; 13 (02) 90-110
27 Xu L, Zeng X, Li W, Ling B. IDHashGAN: deep hashing with generative adversarial nets for incomplete data retrieval. IEEE Trans Multimed 2021; 24: 534-545
28 Feldman K, Faust L, Wu X, Huang Chao, Chawla NV. Beyond volume: the impact of complex healthcare data on the machine learning pipeline. Towards Integrative Machine Learning Knowledge Extraction 2017; 10344: 150-169
29 Mattei P-A, Frellsen J. MIWAE: Deep generative modelling and imputation of incomplete data sets. In: International conference on machine learning. PMLR. 2019:4413–4423
30 Johnson AE, Pollard TJ, Shen L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 2016; 3: 160035
31 Baucum M, Khojandi A, Vasudevan R. Improving deep reinforcement learning with transitional variational autoencoders: a healthcare application. IEEE J Biomed Health Inform 2021; 25 (06) 2273-2280
32 Torfi A, Fox EA. COR-GAN: correlation-capturing convolutional neural networks for generating synthetic healthcare records. Mach Learn 2020; (e-pub ahead of print)
33 Suo Q, Zhong W, Ma F, Ye Y, Jing G, Zhang A. Metric learning on healthcare data with incomplete modalities. In: IJCAI. 2019:3534–3540
34 Lee D, Kim J, Moon W-J, Ye JC. CollaGAN: collaborative GAN for missing image data imputation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:2487–2496
35 Fouladvand S, Talbert J, Dwoskin LP. et al. Predicting opioid use disorder from longitudinal healthcare data using multi-stream transformer 2021 (e-pub ahead of print)
36 Shome D, Kar T, Mohanty SN. et al. Covid-transformer: Interpretable covid-19 detection using vision transformer for healthcare. Int J Environ Res Public Health 2021; 18 (21) 11086
37 Salmi S, Mérelle S, Gilissen R, van der Mei R, Bhulai S. Detecting changes in help seeker conversations on a suicide prevention helpline during the COVID- 19 pandemic: in-depth analysis using encoder representations from transformers. BMC Public Health 2022; 22 (01) 530
38 Zeng X, Linwood SL, Liu C. Pretrained transformer framework on pediatric claims data for population specific tasks. Sci Rep 2022; 12 (01) 3651
39 Amin-Nejad A, Ive J, Velupillai S. Exploring transformer text generation for medical dataset augmentation. In: Proceedings of the 12th Language Resources and Evaluation Conference. 2020: 4699-4708
40 Jonker R, Volgenant A. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 1987; 38: 325-340
41 Kuhn HW. The Hungarian method for the assignment problem. Nav Res Logist Q 1955; 2: 83-97
42 Gao N, Xue H, Shao W. et al. Generative adversarial networks for spatio-temporal data: a survey. Clin Orthop Relat Res 2020; (e-pub ahead of print)
43 Johnson A, Bulgarelli L, Pollard T, Celi LA, Mark R, Horng S. MIMIC-IV-ED (version 2.2). PhysioNet 2023 (e-pub ahead of print)
44 Ghadirzadeh A, Poklukar P, Kyrki V, Kragic D, Björkman M. Data-efficient visuomotor policy training using reinforcement learning and generative models. 2020 (e-pub ahead of print)