Synthetic Data Generation for Validation of Clinical Research Software
DOI:
https://doi.org/10.63282/3050-9416.IJAIBDCMS-V6I4P126Keywords:
Synthetic Data, Data Generation, Clinical Research Software, Software Validation, Data Simulation, Clinical Data Modeling, Electronic Health Records (EHR) Simulation, Patient Data Privacy, Regulatory ComplianceAbstract
Validation of clinical research software is limited by the lack of realistic, privacy-safe datasets capable of exercising complex protocol and workflow logic. This study proposes a hybrid synthetic-data generation framework that combines deterministic clinical simulation, deep generative modeling, and differential privacy to create statistically faithful and audit-traceable datasets tailored for DCT and eCOA validation. In a Phase-II–scale evaluation, the approach achieved 38% higher defect detection, JS divergence = 0.054, and membership-inference AUC ≈ 0.52, demonstrating that synthetic data can support scalable, privacy-preserving, and empirically rigorous validation of regulated clinical-research platforms.
References
1. Gonçalves, C. Ray, and C. Rusu, “Generation and evaluation of synthetic patient data,” BMC Medical Research Methodology, vol. 20, no. 108, 2020, doi: 10.1186/s12874-020-00977-1.
2. J. Walonoski et al., “Synthea™: A synthetic patient generator for benchmarking health IT tools,” Journal of the American Medical Informatics Association, vol. 25, no. 3, pp. 230–238, 2018, doi: 10.1093/jamia/ocx079.
3. M. K. Baowaly, C.-C. Lin, C.-L. Liu, and K.-T. Chen, “Synthesizing electronic health records using improved generative adversarial networks,” Journal of the American Medical Informatics Association, vol. 26, no. 3, pp. 228–241, 2019, doi: 10.1093/jamia/ocy142.
4. E. Choi et al., “Generating multi-label discrete patient records using generative adversarial networks,” in
5. Proceedings of the Machine Learning for Healthcare Conference, vol. 68, PMLR, 2017, pp. 286–305.
6. Torfi, E. A. Fox, and C. K. Reddy, “Differentially private synthetic medical data generation using convolutional GANs,” Information Sciences, vol. 586, pp. 485–500, 2022.
7. G. Nikolentzos et al., “Synthetic electronic health records generated with graph-based deep generative models,”
8. NPJ Digital Medicine, 2023.
9. K. El Emam and L. Arbuckle, Anonymizing Health Data: Case Studies and Methods to Get You Started. Sebastopol, CA, USA: O’Reilly Media, 2013.
10. K. El Emam, E. Jonker, L. Arbuckle, and B. Malin, “A systematic review of re-identification attacks on health data,” PLoS ONE, vol. 6, no. 12, e28071, 2011, doi: 10.1371/journal.pone.0028071.
11. Gonzales, G. Guruswamy, and S. R. Smith, “Synthetic data in health care: A narrative review,” PLoS Digital Health, vol. 2, no. 1, e0000082, 2023, doi: 10.1371/journal.pdig.0000082.
12. M. Beigi et al., “Simulants: Synthetic clinical trial data via subject-level simulation,” Contemporary Clinical Trials Communications, 2023, doi: 10.1016/j.conctc.2023.101182.
13. K. El Emam, “Utility metrics for evaluating synthetic health data generation methods,” JMIR Medical Informatics, 2022, doi: 10.2196/38143.
14. A. Naseer et al., “ScoEHR: Synthetic electronic health records generation with continuous-time diffusion models,” in Proceedings of Machine Learning and Systems, 2023.
15. U.S. Food and Drug Administration, Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan, FDA, 2021.
16. European Medicines Agency, Reflection Paper on the Use of Artificial Intelligence in the Medicinal Product Lifecycle, EMA/36932/2023, 2023.