Clean Before Predict: A Governance-First Methodology for High-Stakes AI Systems

Nidhi Singh

doi:10.63282/3050-9416.IJAIBDCMS-V7I2P109

Authors

Nidhi Singh Senior Data Analyst, State of Alabama, AL USA. Author

DOI:

https://doi.org/10.63282/3050-9416.IJAIBDCMS-V7I2P109

Keywords:

Governance-First AI, Clean-Before-Predict (CFP), High-Stakes AI Systems, Healthcare AI, Data Quality, Data Governance, Fairness In AI, Risk-Aware Machine Learning, Trustworthy AI, Clinical Decision Support, Bias Mitigation, MIMIC-III Dataset

Abstract

Artificial intelligence (AI) systems, especially high-stakes ones, particularly in clinical domains, require not only predictive accuracy but also robustness, fairness and reliability. Conventional machine learning pipelines are mainly concerned with optimization of prediction, and usually fail to consider data quality, bias and risk-related matters that may result in unsafe or unreliable results. To overcome this shortcoming, this paper suggests a Governance-First, Clean-Before-Predict (CFP) model that re-organizes the traditional pipeline by imposing data cleaning and governance limitations before model training. The suggested methodology includes the four phases of data cleaning and quality checks, implementation of governance according to fairness and compliance indicators, risk-sensitive predictive modeling, and overall assessment. The experiments on the MIMIC-III clinical data with Logistic Regression, Random Forest and XGBoost show that CFP framework can be as effective as baseline models in terms of Accuracy, F1-score, and ROC-AUC, with the added benefit of increasing data reliability and decreasing the impact of noisy and biased samples. It is worth noting that XGBoost performs better in CFP setting. These findings suggest that the suggested solution increases stability and reliability without affecting the predictive accuracy of the tool considerably, which is why it can be used in high-stakes AI systems.

References

1. Ayyappan, G., Alex, D. S., Loganathan, V., Padma, E., Ilavarasan, S., & A, S. (2025). Federated learning and edge AI for privacy-preserving diabetes prediction in healthcare. In 2025 3rd International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS) (pp. 1116–1121). https://ir.vistas.ac.in/id/eprint/10813/

2. Austin, J. A., Lobo, E. H., Samadbeik, M., Engstrom, T., Philip, R., Pole, J. D., & Sullivan, C. M. (2024). Decades in the making: The evolution of digital health research infrastructure through synthetic data, common data models, and federated learning. Journal of Medical Internet Research, 26. https://doi.org/10.2196/58637

3. Bajwa, J., Munir, U., Nori, A., & Williams, B. (2021). Artificial intelligence in healthcare: Transforming the practice of medicine. Future Healthcare Journal, 8(2), 188–194. https://doi.org/10.7861/fhj.2021-0095

4. Chen, K., Abtahi, F., Carrero, J.-J., Fernandez-Llatas, C., & Seoane, F. (2023). Process mining and data mining applications in the domain of chronic diseases: A systematic review. Artificial Intelligence in Medicine, 144, 102645. https://doi.org/10.1016/j.artmed.2023.102645

5. Dash, S., Padhy, S., Suman, P., Mal, S., Malviya, L., Suman, A., & Kishore, J. (2025). Privacy-preserving diabetes and heart disease prediction via federated learning and WCO. International Journal of Computational Intelligence Systems, 18(1). https://link.springer.com/article/10.1007/s44196-025-00956-8

6. Davari, F., Isfahani, M. N., Atighechian, A., & Ghobadian, E. (2024). Optimizing emergency department efficiency: A comparative analysis of process mining and simulation models to mitigate overcrowding and waiting times. BMC Medical Informatics and Decision Making, 24(1). https://doi.org/10.1186/s12911-024-02704-y

7. El Majdoubi, D., El Bakkali, H., Sadki, S., Maqour, Z., & Leghmid, A. (2022). The systematic literature review of privacy-preserving solutions in smart healthcare environment. Security and Communication Networks, 2022, 1–26. https://onlinelibrary.wiley.com/doi/10.1155/2022/5642026

8. Fahim, Y. A., Hasani, I. W., Kabba, S., & Ragab, W. M. (2025). Artificial intelligence in healthcare and medicine: Clinical applications, therapeutic advances, and future perspectives. European Journal of Medical Research, 30(1). https://doi.org/10.1186/s40001-025-03196-w

9. Falco, I. D., Cioppa, A. D., Koutny, T., Ubl, M., Krcma, M., Scafuri, U., & Tarantino, E. (2023). A federated learning-inspired evolutionary algorithm: Application to glucose prediction. Sensors, 23(6), 2957. https://doi.org/10.3390/s23062957

10. Faridoon, A., & Kechadi, M. T. (2024). Healthcare data governance, privacy, and security - A conceptual framework. ArXiv. https://doi.org/10.48550/arXiv.2403.17648

11. Fuladi, S., Ruby, D., Manikandan, N., Verma, A., Nallakaruppan, M. K., Selvarajan, S., Meena, P., Meena, V. P., & Hameed, I. A. (2025). A reliable and privacy-preserved federated learning framework for real-time smoking prediction in healthcare. Frontiers in Computer Science, 6. https://doi.org/10.3389/fcomp.2024.1494174

12. Ganapathy, G., Anand, S. J., Jayaprakash, M., Lakshmi, S., Priya, V. B., & Pandi, S. (2024). A blockchain based federated deep learning model for secured data transmission in healthcare IoT networks. Measurement Sensors, 101176. https://www.sciencedirect.com/science/article/pii/S2665917424001521

13. Hasan, M. R., Li, Q., Saha, U., & Li, J. (2024). Decentralized and secure collaborative framework for personalized diabetes prediction. Biomedicines, 12(8), 1916. https://doi.org/10.3390/biomedicines12081916

14. Islam, H., Mosa, A., & FAMIA. (2022). A federated mining approach on predicting diabetes-related complications: Demonstration using real-world clinical data. AMIA Annual Symposium Proceedings, 2021, 556. https://pmc.ncbi.nlm.nih.gov/articles/PMC8861723/

15. Karunanayake, N. (2025). Next-generation agentic AI for transforming healthcare. Informatics and Health, 2(2), 73–83. https://doi.org/10.1016/j.infoh.2025.03.001

16. Kumar, M., & Malik, A. (2025). Federated learning for privacy-preserving diabetes prediction: Challenges, solutions, and future directions. In 2025 International Conference on Electronics, AI and Computing (EAIC) (pp. 1–6). https://doi.org/10.1109/EAIC66483.2025.11101462

17. Kumari, M., & Kumar, A. (2025). A unified approach of blockchain distributed systems for protecting health data privacy and security from cyber attacks. In 2025 7th International Conference on Information Systems and Computer Networks (ISCON) (pp. 1–6). https://ieeexplore.ieee.org/abstract/document/11341557

18. Kuzlu, M., Xiao, Z., Tabassum, M., & Catak, F. O. (2023). A robust diabetes mellitus prediction system based on federated learning strategies. In 2023 International Conference on Intelligent Computing, Communication, Networking and Services (ICCNS) (pp. 246–253). https://ieeexplore.ieee.org/abstract/document/10192981

19. Lämmermann, L., Hofmann, P., & Urbach, N. (2024). Managing artificial intelligence applications in healthcare: Promoting information processing among stakeholders. International Journal of Information Management, 75, 102728. https://www.sciencedirect.com/science/article/pii/S0268401223001093

20. Letourneau-Guillon, L., Camirand, D., Guilbert, F., & Forghani, R. (2020). Artificial intelligence applications for workflow, process optimization and predictive analytics. Neuroimaging Clinics of North America, 30(4), e1–e15. https://doi.org/10.1016/j.nic.2020.08.008

21. Li, Z., Wu, X., Wu, J., & Wu, X. (2025). PMPCO: A process-mining-enhanced deep learning framework for accurate and interpretable EMR predictions. In Proceedings of the 2025 6th International Symposium on Artificial Intelligence for Medical Sciences.

22. Liang, X., Zhao, J., Chen, Y., Bandara, E., & Shetty, S. (2023). Architectural design of a blockchain-enabled, federated learning platform for algorithmic fairness in predictive health care: Design science study. Journal of Medical Internet Research, 25, e46547. https://doi.org/10.2196/46547

23. Moshawrab, M., Adda, M., Bouzouane, A., Ibrahim, H., & Raad, A. (2023). Reviewing federated machine learning and its use in diseases prediction. Sensors, 23(4), 2112. https://doi.org/10.3390/s23042112

24. Munoz-Gama, J., Martin, N., Fernandez-Llatas, C., Johnson, O. A., Sepúlveda, M., Helm, E., Galvez-Yanjari, V., Rojas, E., Martinez-Millana, A., Aloini, D., Amantea, I. A., Andrews, R., Arias, M., Beerepoot, I., Benevento, E., Burattin, A., Capurro, D., Carmona, J., Comuzzi, M., & Dalmas, B. (2022). Process mining for healthcare: Characteristics and challenges. Journal of Biomedical Informatics, 127(1). https://doi.org/10.1016/j.jbi.2022.103994

25. Murazzano, L., & Landa, P. (2025). An overview of the implementation of artificial intelligence in clinical pathways. In Springer Proceedings in Mathematics & Statistics (pp. 37–46). https://doi.org/10.1007/978-3-031-95659-1_4

Clean Before Predict: A Governance-First Methodology for High-Stakes AI Systems

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

Callpaper

Menu

Information

Keywords

Latest publications