Evaluating the Impact of Health Care Data Completeness for Deep Generative Models

Benjamin Smith; Senne Van Steelandt; Anahita Khojandi

doi:10.1055/a-2023-9181

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035037.xml

Share / Bookmark

Facebook X Linkedin Weibo

Download PDF

Methods Inf Med 2023; 62(01/02): 031-039
DOI: 10.1055/a-2023-9181

Original Article for Focus Theme

Evaluating the Impact of Health Care Data Completeness for Deep Generative Models

Benjamin Smith‡^*

¹Bredesen Center, University of Tennessee, Knoxville, Tennessee, United States

,

Senne Van Steelandt‡^*

²Department of Business Analytics and Statistics, University of Tennessee, Knoxville, Tennessee, United States

,

Anahita Khojandi

³Department of Industrial and Systems Engineering, University of Tennessee, Knoxville, Tennessee, United States

› Author Affiliations

› Further Information

Abstract
Full Text
References

Permissions and Reprints

Abstract

Background Deep generative models (DGMs) present a promising avenue for generating realistic, synthetic data to augment existing health care datasets. However, exactly how the completeness of the original dataset affects the quality of the generated synthetic data is unclear.

Objectives In this paper, we investigate the effect of data completeness on samples generated by the most common DGM paradigms.

Methods We create both cross-sectional and panel datasets with varying missingness and subset rates and train generative adversarial networks, variational autoencoders, and autoregressive models (Transformers) on these datasets. We then compare the distributions of generated data with original training data to measure similarity.

Results We find that increased incompleteness is directly correlated with increased dissimilarity between original and generated samples produced through DGMs.

Conclusions Care must be taken when using DGMs to generate synthetic data as data completeness issues can affect the quality of generated data in both panel and cross-sectional datasets.

Keywords

data quality - data completeness - case completeness - missingness - deep generative models

^* Contributed equally.

Publication History

Received: 29 June 2022

Accepted: 31 January 2023

Accepted Manuscript online:
31 January 2023

Article published online:
10 March 2023

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

References
1 Chen RJ, Lu MY, Chen TY, Williamson DFK, Mahmood F. Synthetic data in machine learning for medicine and healthcare. Nat Biomed Eng 2021; 5 (06) 493-497

Crossref PubMed Google Scholar
2 Wang Z, Myles P. Tucker Generating and evaluating cross-sectional synthetic electronic healthcare data: preserving data utility and patient privacy. Comput Intell 2021; 37 (02) 819-851

Crossref PubMed Google Scholar
3 Bhanot K, Qi M, Erickson JS, Guyon I, Bennett KP. The problem of fairness in synthetic healthcare data. Entropy (Basel) 2021; 23 (09) 1165

Crossref PubMed Google Scholar
4 Kusner MJ, Paige B, Hernández-Lobato JM. Grammar variational autoencoder. In International Conference on Machine Learning: PMLR. 2017:1945–1954

PubMed Google Scholar
5 Isola P, Zhu J-Y, Zhou T, Efros AA. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition. 2017:1125–1134

PubMed Google Scholar
6 Eigenschink P, Vamosi S, Vamosi R, Sun C, Reutterer T, Kalcher K. Deep generative models for synthetic data. 2021

PubMed Google Scholar
7 Shahrin MH, Wyse L. Deep generative models for musical audio synthesis. In Handbook of Artificial Intelligence for Music. Springer; 2021: 639-678

Google Scholar
8 Esteban P, Alvaro G, Cecilio A. Generating synthetic ECGs using GANs for anonymizing healthcare data. Electronics (Basel) 2021; 10: 389

PubMed Google Scholar
9 Raab GM, Beata N, Chris D. Guidelines for producing useful synthetic data. arXiv e-prints 2017 (e-pub ahead of print) DOI: 10.48550/arXiv.1712.04078

Crossref PubMed Google Scholar
10 Weiskopf NG, Hripcsak G, Swaminathan S, Weng C. Defining and measuring completeness of electronic health records for secondary use. J Biomed Inform 2013; 46 (05) 830-836

Crossref PubMed Google Scholar
11 Burkhart L, Androwich I. Measuring the domain completeness of the Nursing Interventions Classification in parish nurse documentation. Comput Inform Nurs 2004; 22 (02) 72-82

Crossref PubMed Google Scholar
12 Wright A, McCoy AB, Hickman T-TT. et al. Problem list completeness in electronic health records: a multi-site study and assessment of success factors. Int J Med Inform 2015; 84 (10) 784-790

Crossref PubMed Google Scholar
13 Beaulieu-Jones BK, Moore JH. Missing data imputation in the electronic health record using deeply learned autoencoders. Pac Symp Biocomput 2017; 22: 207-218

PubMed Google Scholar
14 Angelos K, Apoorv V, Nikolaos P, François F. Transformers are RNNs: fast autoregressive transformers with linear attention. In: International Conference on Machine Learning. PMLR. 2020: 5156–5165

PubMed Google Scholar
15 Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17) Red Hook, NY, USA: Curran Associates Inc; 2017: 6000-6010

Google Scholar
16 Hilsenbeck SG, Kurucz C, Duncan RC. Estimation of completeness and adjustment of age-specific and age-standardized incidence rates. Biometrics 1992; 48 (04) 1249-1262

Crossref PubMed Google Scholar
17 Kodra Y, Posada de la Paz M, Coi A. et al. Data quality in rare diseases registries. Adv Exp Med Biol 2017; 1031: 149-164

Crossref PubMed Google Scholar
18 Reiter JP. Simultaneous use of multiple imputation for missing data and disclosure limitation. Surv Methodol 2004; 30: 235-242

PubMed Google Scholar
19 Dietterich TG, Kong EB. Machine learning bias, statistical bias, and statistical variance of decision tree algorithms technical report, Department of Computer Science, Oregon State University 1995
20 Wang X, Lyu Y, Jing L. Deep generative model for robust imbalance classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:14124–14133

PubMed Google Scholar
21 Little RJ, D'Agostino R, Cohen ML. et al. The prevention and treatment of missing data in clinical trials. N Engl J Med 2012; 367 (14) 1355-1360

Crossref PubMed Google Scholar
22 Faris PD, Ghali WA, Brant R, Norris CM, Galbraith PD, Knudtson ML. APPROACH Investigators. Alberta Provincial Program for Outcome Assessment in Coronary Heart Disease. Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. J Clin Epidemiol 2002; 55 (02) 184-191

Crossref PubMed Google Scholar
23 Kelly B, Matthews TP, Anastasio MA. Deep learning-guided image reconstruction from incomplete data. 2017 (e-pub ahead of print) DOI: 10.48550/arXiv.1709.00584

Crossref PubMed Google Scholar
24 Markey MK, Tourassi GD, Margolis M, DeLong DM. Impact of missing data in evaluating artificial neural networks trained on complete data. Comput Biol Med 2006; 36 (05) 516-525

Crossref PubMed Google Scholar
25 Li SC-X, Bo J, Marlin B. MisGAN: learning from incomplete data with generative adversarial networks. 2019 (e-pub ahead of print) DOI: 10.48550/arXiv.1902.09599

Crossref PubMed Google Scholar
26 Hu J, Olanrewaju A, Wang Q. Multiple Imputation and Synthetic Data Generation with NPBayesImputeCat. The R Journal 2021; 13 (02) 90-110

PubMed Google Scholar
27 Xu L, Zeng X, Li W, Ling B. IDHashGAN: deep hashing with generative adversarial nets for incomplete data retrieval. IEEE Trans Multimed 2021; 24: 534-545

Crossref PubMed Google Scholar
28 Feldman K, Faust L, Wu X, Huang Chao, Chawla NV. Beyond volume: the impact of complex healthcare data on the machine learning pipeline. Towards Integrative Machine Learning Knowledge Extraction 2017; 10344: 150-169

Crossref PubMed Google Scholar
29 Mattei P-A, Frellsen J. MIWAE: Deep generative modelling and imputation of incomplete data sets. In: International conference on machine learning. PMLR. 2019:4413–4423

PubMed Google Scholar
30 Johnson AE, Pollard TJ, Shen L. et al. MIMIC-III, a freely accessible critical care database. Sci Data 2016; 3: 160035

Crossref PubMed Google Scholar
31 Baucum M, Khojandi A, Vasudevan R. Improving deep reinforcement learning with transitional variational autoencoders: a healthcare application. IEEE J Biomed Health Inform 2021; 25 (06) 2273-2280

Crossref PubMed Google Scholar
32 Torfi A, Fox EA. COR-GAN: correlation-capturing convolutional neural networks for generating synthetic healthcare records. Mach Learn 2020; (e-pub ahead of print) DOI: 10.48550/arXiv.2001.09346.

Crossref PubMed Google Scholar
33 Suo Q, Zhong W, Ma F, Ye Y, Jing G, Zhang A. Metric learning on healthcare data with incomplete modalities. In: IJCAI. 2019:3534–3540

Google Scholar
34 Lee D, Kim J, Moon W-J, Ye JC. CollaGAN: collaborative GAN for missing image data imputation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:2487–2496

PubMed Google Scholar
35 Fouladvand S, Talbert J, Dwoskin LP. et al. Predicting opioid use disorder from longitudinal healthcare data using multi-stream transformer 2021 (e-pub ahead of print) DOI: 10.48550/arXiv.2103.08800

Crossref PubMed
36 Shome D, Kar T, Mohanty SN. et al. Covid-transformer: Interpretable covid-19 detection using vision transformer for healthcare. Int J Environ Res Public Health 2021; 18 (21) 11086

Crossref PubMed Google Scholar
37 Salmi S, Mérelle S, Gilissen R, van der Mei R, Bhulai S. Detecting changes in help seeker conversations on a suicide prevention helpline during the COVID- 19 pandemic: in-depth analysis using encoder representations from transformers. BMC Public Health 2022; 22 (01) 530

Crossref PubMed Google Scholar
38 Zeng X, Linwood SL, Liu C. Pretrained transformer framework on pediatric claims data for population specific tasks. Sci Rep 2022; 12 (01) 3651

Crossref PubMed Google Scholar
39 Amin-Nejad A, Ive J, Velupillai S. Exploring transformer text generation for medical dataset augmentation. In: Proceedings of the 12th Language Resources and Evaluation Conference. 2020: 4699-4708

PubMed Google Scholar
40 Jonker R, Volgenant A. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 1987; 38: 325-340

Crossref PubMed Google Scholar
41 Kuhn HW. The Hungarian method for the assignment problem. Nav Res Logist Q 1955; 2: 83-97

Crossref PubMed Google Scholar
42 Gao N, Xue H, Shao W. et al. Generative adversarial networks for spatio-temporal data: a survey. Clin Orthop Relat Res 2020; (e-pub ahead of print) DOI: 10.48550/arXiv.2008.08903.

Crossref PubMed Google Scholar
43 Johnson A, Bulgarelli L, Pollard T, Celi LA, Mark R, Horng S. MIMIC-IV-ED (version 2.2). PhysioNet 2023 (e-pub ahead of print) DOI: 10.13026/5ntk-km72

Crossref PubMed Google Scholar
44 Ghadirzadeh A, Poklukar P, Kyrki V, Kragic D, Björkman M. Data-efficient visuomotor policy training using reinforcement learning and generative models. 2020 (e-pub ahead of print) DOI: 10.48550/arXiv.2007.13134

Crossref PubMed Google Scholar

Subscribe to RSS

Share / Bookmark

Evaluating the Impact of Health Care Data Completeness for Deep Generative Models

Abstract

Keywords

Publication History

References