Methods Inf Med
DOI: 10.1055/a-2540-8284
Letter to the Editor

Why Synthetic Discoveries are Not Only a Problem of Differentially Private Synthetic Data

Heidelinde Dehaene
1   Ghent University Hospital – SYNDARA Research Group, Ghent, Belgium
,
Alexander Decruyenaere
1   Ghent University Hospital – SYNDARA Research Group, Ghent, Belgium
,
Christiaan Polet
1   Ghent University Hospital – SYNDARA Research Group, Ghent, Belgium
,
Johan Decruyenaere
1   Ghent University Hospital – SYNDARA Research Group, Ghent, Belgium
,
Paloma Rabaey
2   Ghent University – Imec, Belgium
,
Thomas Demeester
2   Ghent University – Imec, Belgium
,
Stijn Vansteelandt
3   Department of Applied Mathematics, Computer Science, and Statistics, Ghent University, Belgium
› Author Affiliations
Funding This research was funded by a grant received from the Fund for Innovation and Clinical Research of Ghent University Hospital. P.R.'s research is funded by the Research Foundation Flanders (FWO-Vlaanderen) with grant number 1170122N. This research also received funding from the Flemish government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme.

Response to Commentary by Dehaene et al. on Synthetic Discovery is not only a Problem of Differentially Private Synthetic Data

We have read the article entitled “Does differentially private synthetic data lead to synthetic discoveries?” by Perez et al[1] published in Methods of Information in Medicine.

The authors argue that in addition to preserving privacy, the generated data should have high utility, meaning the degree to which the inferences obtained from the synthetic data correspond to inferences obtained from the original data. We refer to that utility as “inferential utility,” the evaluation of whether analysis of a synthetic sample can deliver valid estimates of a population parameter and whether it enables the testing of hypotheses (also see Decruyenaere et al).[2] We agree with their perspective that the validation of synthetic data should go beyond the popular utility metrics, which often entail some quantification of how well the synthetic data resemble the real data or preserve their statistical information.

Perez et al[1] concluded that synthetic data generation methods with differential privacy (DP) guarantees are prone to generate data from which false discoveries are likely to be made. In their conclusion, they postulate that false discoveries may be mitigated by ensuring the size of the original dataset is sufficiently large and by selecting methods that are less prone to adding the false signal to data (i.e., non-DP generators). We would like to critically reflect on these two statements, supported by findings in Decruyenaere et al.[2]

In particular, Decruyenaere et al[2] investigated the inferential utility of tabular synthetic data, over a wide range of generators and estimators. They find that inferential statements based on synthetic data as if they were actually observed, deliver an unacceptable increase in false discoveries. This conclusion holds regardless of the method used to generate the synthetic data, that is, both statistical (classical) approaches and deep learning (DL) approaches (including DP generators) suffer from this problem.

To understand where this inflated amount of false discoveries originate from, simulation studies in Decruyenaere et al[2] confirmed the underestimation of the standard error of various estimators (such as the sample mean, proportions, and regression coefficients) when synthetic data are used. The process of learning the distribution of the original data from which to generate synthetic data creates excess variability that degrades the quality of estimates. When data are generated using correct parametric models, this degree of excess variability is easy to predict, as in corrected standard errors proposed by Raab et al.[3] However, this is no longer the case when synthetic data are generated using a DL approach, due to regularization bias that is difficult to remove. As demonstrated in Decruyenaere et al,[2] this implies that even the corrected standard errors remain drastically underestimated when used for synthetic data created with a DL approach, leading to an unacceptable increase in false discoveries.

We thus conclude that standard statistical inference based on synthetic data, as currently generated by DL approaches, may be highly misleading. With this letter, we aim to counter a possible interpretation of the work by Perez et al,[1] namely that the risk of unacceptably high false-positive findings may be avoided by stepping away from DP methods or by increasing the size of the original data. Instead, that risk remains present, regardless of DP guarantees, the size of original data, or even previously proposed correction factors as soon as deep learning approaches are used to generate the considered synthetic data. This calls for the development of more refined synthetic data generation approaches with better utility in providing accurate statistical inferences.



Publication History

Received: 18 September 2024

Accepted: 14 February 2025

Article published online:
15 April 2025

© 2025. Thieme. All rights reserved.

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany

 
  • References

  • 1 Montoya Perez I, Movahedi P, Nieminen V, Airola A, Pahikkala T. Does differentially private synthetic data lead to synthetic discoveries?. Methods Inf Med 2024; 63 (1-02): 35-51
  • 2 Decruyenaere A, Dehaene H, Rabaey P. et al. The Real Deal Behind the Artificial Appeal: Inferential Utility of Tabular Synthetic Data. Paper presented at: The 40th Conference on Uncertainty in Artificial Intelligence; July 18, 2024; Barcelona, ES
  • 3 Raab GM, Nowok B, Dibben C. Practical data synthesis for large samples. J Priv Confid 2018; 7 (03) 69-97