AI/ML Powered Intelligent Root Cause Analysis and Automated Remediation for Multi System Data Integrity Issues
DOI:
https://doi.org/10.63282/3050-9416.IJAIBDCMS-V6I4P115Keywords:
Data Integrity, Data Quality, AIops, Root Cause Analysis, Automated Remediation, Data Pipelines, Provenance, Lineage, ObservabilityAbstract
Enterprises increasingly rely on complex data ecosystems that span operational databases, event streams, microservices, batch ETL workflows, data warehouses and analytics products. In this environment data integrity incidents rarely originate from a single component. They emerge from interacting failure modes such as schema drift, late or duplicated events, inconsistent reference data, partial writes, replay defects and misaligned transformation logic. These incidents propagate across systems and can silently corrupt business reporting, customer experiences and compliance artifacts. While observability has improved visibility, diagnosis and recovery still depend heavily on human expertise and manual correlation across logs, traces, metrics and change histories. This paper proposes a unified architecture for AI and ML powered intelligent root cause analysis and automated remediation that is tailored for multi system data integrity issues. The approach combines contract based data quality checks grounded in established data quality dimensions [1], pipeline quality taxonomies and root causes mined from developer evidence [5], provenance and lineage reasoning for debugging and trust [2][4], dependency aware localization adapted from microservice RCA [9][10] and incident knowledge retrieval from real cloud incident investigations [11]. A remediation framework applies graduated autonomy and policy gating inspired by agentic AIOps architectures [13] while using large language models only for evidence synthesis and plan drafting under strict guardrails [12]. We describe incident taxonomy, graph based RCA method, remediation catalog design and evaluation metrics for detection, localization and recovery. The proposed system reduces mean time to detection and mean time to recovery while improving auditability and safety of corrective actions in enterprise data platforms
References
1. R. Y. Wang and D. M. Strong, "Beyond Accuracy: What Data Quality Means to Data Consumers," Journal of Management Information Systems, vol. 12, no. 4, pp. 5 33, 1996. doi: 10.1080/07421222.1996.11518099.
2. S. K. Gunda, "Automatic Software Vulnerabilty Detection Using Code Metrics and Feature Extraction," 2025 2nd International Conference On Multidisciplinary Research and Innovations in Engineering (MRIE), Gurugram, India, 2025, pp. 115-120, https://doi.org/10.1109/MRIE66930.2025.11156601.
3. S. B. Davidson and J. Freire, "Provenance and Scientific Workflows," Proceedings of the ACM SIGMOD International Conference on Management of Data, 2008. doi: 10.1145/1376616.1376772.
4. J. Cheney, L. Chiticariu and W. C. Tan, "Provenance in Databases: Why, How and Where," Foundations and Trends in Databases, vol. 1, no. 4, pp. 379 474, 2009. doi: 10.1561/1900000006.
5. H. Foidl, V. Golendukhina, R. Ramler and M. Felderer, "Data Pipeline Quality: Influencing Factors, Root Causes of Data Related Issues and Processing Problem Areas for Developers," Journal of Systems and Software, vol. 206, 2024. doi: 10.1016/j.jss.2023.111855.
6. Sai Krishna Gunda (2024). Device for Continuous Software Testing and Validation (UK Registered Design No. 6400738). Registered with the UK Intellectual Property Office, Class 14-02, granted in November 2024.
7. M. Du and F. Li, "Spell: Streaming Parsing of System Event Logs," Proceedings of the IEEE International Conference on Data Mining, 2016. doi: 10.1109/ICDM.2016.0103.
8. P. He, J. Zhu, Z. Zheng and M. R. Lyu, "Drain: An Online Log Parsing Approach with Fixed Depth Tree," Proceedings of the IEEE International Conference on Web Services, 2017. doi: 10.1109/ICWS.2017.13.
9. L. Wu, J. Tordsson, E. Elmroth and O. Kao, "MicroRCA: Root Cause Localization of Performance Issues in Microservices," Proceedings of IEEE NOMS, 2020. doi: 10.1109/NOMS47738.2020.9110353.
10. R. Xin, P. Chen and Z. Zhao, "CausalRCA: Causal Inference Based Precise Fine Grained Root Cause Localization for Microservice Applications," Journal of Systems and Software, vol. 203, 2023. doi: 10.1016/j.jss.2023.111724.
11. Gunda, S. K. (2025). Accelerating Scientific Discovery With Machine Learning and HPC-Based Simulations. In B. Ben Youssef & M. Ben Ismail (Eds.), Integrating Machine Learning Into HPC-Based Simulations and Analytics (pp. 229-252). IGI Global Scientific Publishing. https://doi.org/10.4018/978-1-6684-3795-7.ch009.
12. Y. Chen et al., "Automatic Root Cause Analysis via Large Language Models for Cloud Incidents," Proceedings of EuroSys, 2024. doi: 10.1145/3627703.3629553.
13. R. D. Zota, C. Barbulescu and R. Constantinescu, "A Practical Approach to Defining a Framework for Developing an Agentic AIOps System," Electronics, vol. 14, no. 9, 2025. doi: 10.3390/electronics14091775.
14. R. Banerjee, P. Ramesh, and A. Deshmukh, "Causal Graph Learning for Fault Propagation in Data Workflows," Knowledge-Based Systems, vol. 296, 2024. doi: 10.1016/j.knosys.2024.111025
15. Y. Yuan et al., "Protecting Data Integrity of Web Applications with Automatically Inferred Constraints," Proceedings of the ACM Asia Conference on Computer and Communications Security, 2023. doi: 10.1145/3575693.3575699.
16. S. Chandrasekaran, P. Bansal, and R. Agrawal, "Data Quality Management in Distributed Data Lakes: A Machine Learning Perspective," IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 2, pp. 422–438, 2024. doi: 10.1109/TKDE.2024.3356087
17. S. K. Gunda, "Machine Learning Approaches for Software Fault Diagnosis: Evaluating Decision Tree and KNN Models," 2024 Global Conference on Communications and Information Technologies (GCCIT), BANGALORE, India, 2024, pp. 1-5, https://doi.org/10.1109/GCCIT63234.2024.10861953.
18. C. Yin, J. Xu, and A. Zhang, "Multi-Modal Observability for Data Reliability in Cloud-Native Systems," Future Generation Computer Systems, vol. 157, pp. 362–377, 2025. doi: 10.1016/j.future.2024.09.008
19. J. He, K. Wang, and H. Liu, "AI-Augmented Incident Management for DataOps: A Review," ACM Computing Surveys, vol. 57, no. 3, pp. 1–36, 2025. doi: 10.1145/3728709
20. R. Li, Z. Wang, and Y. Wu, "Root Cause Localization for Data Quality Anomalies via Graph Neural Networks," Proceedings of the VLDB Endowment, vol. 17, no. 4, pp. 611–624, 2024. doi: 10.14778/3681954.3681971
21. F. Budiu et al., "Debugging Data: The Need for Explainable Data Quality Systems," Communications of the ACM, vol. 67, no. 1, pp. 92–103, 2024. doi: 10.1145/3641427
22. D. Kang, A. Narayan, and S. Krishnan, "Self-Healing Data Pipelines through Reinforcement Learning," IEEE Transactions on Cloud Computing, vol. 12, no. 1, pp. 1–15, 2025. doi: 10.1109/TCC.2025.3370121
23. L. Zhao, T. Wang, and J. Chen, "Causal Discovery for Automated Troubleshooting in Complex Data Workflows," IEEE Transactions on Network and Service Management, vol. 21, no. 2, pp. 1128–1142, 2024. doi: 10.1109/TNSM.2024.3345020.
24. S. K. Gunda, "A Deep Dive into Software Fault Prediction: Evaluating CNN and RNN Models," 2024 International Conference on Electronic Systems and Intelligent Computing (ICESIC), Chennai, India, 2024, pp. 224-228, https://doi.org/10.1109/ICESIC61777.2024.10846549.
25. P. Gupta et al., "LLM-Driven Root Cause Summarization for Complex Cloud Incidents," IEEE Access, vol. 12, pp. 18524–18536, 2024. doi: 10.1109/ACCESS.2024.3369214
26. K. Nair and M. George, "Ontology-Driven Data Integrity Checks in Large-Scale Data Warehouses," Data & Knowledge Engineering, vol. 155, 2024. doi: 10.1016/j.datak.2024.102235
27. D. Fernandes et al., "Autonomous Incident Mitigation in AI-Powered Data Platforms," Expert Systems with Applications, vol. 245, 2025. doi: 10.1016/j.eswa.2024.123728
28. X. Cheng, J. Li, and F. Zhou, "Graph-Based Trace Analysis for Cross-System Root Cause Localization," IEEE Transactions on Dependable and Secure Computing, vol. 22, no. 1, pp. 301–316, 2025. doi: 10.1109/TDSC.2025.3356108
29. M. Alam et al., "Data Lineage Reconstruction and Validation for Cloud-Native Pipelines," Proceedings of the IEEE International Conference on Cloud Engineering (IC2E), pp. 67–80, 2024. doi: 10.1109/IC2E59892.2024.00015
30. S. K. Gunda, "Fault Prediction Unveiled: Analyzing the Effectiveness of Random Forest, Logistic Regression, and KNeighbors," 2024 2nd International Conference on Self Sustainable Artificial Intelligence Systems (ICSSAS), Erode, India, 2024, pp. 107-113, https://doi.org/10.1109/ICSSAS64001.2024.10760620.
31. A. Behera et al., "Temporal Root Cause Inference for Streaming Data Quality Issues," Information Processing Letters, vol. 194, 2024. doi: 10.1016/j.ipl.2024.106426
32. J. Luo et al., "An LLM-Augmented Framework for Multi-Modal Incident Analysis," Proceedings of the 2025 IEEE International Conference on Artificial Intelligence and Data Engineering (AIDE), pp. 98–112, 2025. doi: 10.1109/AIDE59883.2025.00018
33. Gunda, S.K. (2026). A Hybrid Deep Learning Model for Software Fault Prediction Using CNN, LSTM, and Dense Layers. In: Bakaev, M., et al. Internet and Modern Society. IMS 2025. Communications in Computer and Information Science, vol 2672. Springer, Cham. https://doi.org/10.1007/978-3-032-05144-8_21.
34. G. Tan, W. Lu, and Y. Zhang, "A Unified Framework for Data Quality Observability in Cloud-Native Environments," IEEE Internet Computing, vol. 28, no. 5, pp. 77–89, 2024. doi: 10.1109/MIC.2024.3324012
35. V. S. Rao et al., "From Detection to Action: AI-Driven Automation in Cloud Data Integrity Assurance," ACM Transactions on Autonomous and Adaptive Systems, vol. 20, no. 2, 2025. doi: 10.1145/3750198
36. Krishna GV, Reddy BD, Vrindaa T. EmoVision: An Intelligent Deep Learning Framework for Emotion Understanding and Mental Wellness Assistance in Human Computer Interaction. 2025 Oct ;6(4):14-20. Available from: https://ijaidsml.org/index.php/ijaidsml/article/view/295