AI Report - Federated AIOps for Multi-Cluster OpenShift

Authors

  • Siva Kantha Rao Vanama Cloud Solution Architect, Mphasis Corporation Tampa, Florida, USA. Author

DOI:

https://doi.org/10.63282/3050-9416.IJAIBDCMS-V6I2P111

Keywords:

Federated Aiops, Multi-Cluster Openshift, Anomaly Detection, Mean Time To Resolve (MTTR), Root Cause Analysis (RCA)

Abstract

Multi-cluster OpenShift systems produce large volumes of data, containing as many as 400-000 metrics and 200GB of logs each day per cluster, overwhelming operations groups. This study shows that Mean Time to Detect (MTTD) decreases by 40-60%, Mean Time to Resolve (MTTR) reduces by 25-55%, and the number of alerts drops by 70% when Federated AIOps is integrated. With the support of a federated AIOps model on OpenShift RHACM, Prometheus, OpenTelemetry, and a cross-cluster machine learning control plane, this research proposes an inferential approach of automation when passing anomaly detection and root cause analysis (RCA) across geographically separated clusters. The study employed data on six clusters, including AWS, Azure, and on-prem systems, 600 + nodes, 12,000 + pods, and 350 microservices. The findings indicate that Federated AIOps had a 94.3% accuracy in data correlations, false-positive alerts to less than 6 occurrences, and less than 2.5% of CPU overhead per node. Federated AIOps can reduce alert duplication and optimize incident management, which provides an open-source alternative to official AIOps platforms that makes it a low-cost, scalable product in the large-scale setting. This study identifies the practical effect of Federated AIOps in improving operational efficiency, resource utilization, and adherence to data privacy standards and generates significant cost reduction in the multi-cluster OpenShift administration

References

[1] Janssen, M., Brous, P., Estevez, E., Barbosa, L. S., & Janowski, T. (2020). Data governance: Organizing data for trustworthy Artificial Intelligence. Government Information Quarterly, 37(3), Article 101493. https://doi.org/10.1016/j.giq.2020.101493

[2] Sivakumar, S. (2023). Performance Bottleneck Detection and Root Cause Analysis Using Explainable AI. Iconic Research And Engineering Journals, 6(10), 1005-1011.

[3] S. K. Gunda, "Analyzing Machine Learning Techniques for Software Defect Prediction: A Comprehensive Performance Comparison," 2024 Asian Conference on Intelligent Technologies (ACOIT), KOLAR, India, 2024, pp. 1-5, https://doi.org/10.1109/ACOIT62457.2024.10939610.

[4] Fontana, G., & Pecora, R. (2022). OpenShift Multi-Cluster Management Handbook.

[5] Dhanagari, M. R. (2024). Scaling with MongoDB: Solutions for handling big data in real-time. Journal of Computer Science and Technology Studies, 6(5), 246-264. https://doi.org/10.32996/jcsts.2024.6.5.20

[6] Polisetty, S. (2023). Training AI Models: Preparing and Managing AI Algorithms for AIOps.

[7] Smith, A., & Ritchie, L. (2023). Systematic literature review of Business Continuity Management (BCM) practices: Integrating organisational resilience and performance in SME BCM framework. International Journal of Disaster Risk Reduction, 99, 104135. https://doi.org/10.1016/j.ijdrr.2023.104135

[8] Hämäläinen, H., Rantanen, I., Aalto, S., & Pum, M. (2021). Monitoring and Observability in Kubernetes Clusters Using Prometheus and Grafana.

[9] Hughey, K. F., & Karp, M. M. (2010). Academic advising and career services: A collaborative approach to student success. New Directions for Student Services, 2010(148), 49–63. Wiley.

[10] Ma, Y., Oslebo, D., Maqsood, A., & Corzine, K. (2020). DC fault detection and pulsed load monitoring using wavelet transform-fed LSTM autoencoders. IEEE Journal of Emerging and Selected Topics in Power Electronics, 9(6), 7078-7087.

[11] Optimizing E-Commerce Revenue: Leveraging Reinforcement Learning and Neural Networks for AI-Powered Dynamic Pricing. (2022). International Journal of AI and ML, 3(9). https://www.cognitivecomputingjournal.com/index.php/IJAIML-V1/article/view/65

[12] Cieslak, M. C., Castelfranco, A. M., Roncalli, V., Lenz, P. H., & Hartline, D. K. (2020). t-Distributed Stochastic Neighbor Embedding (t-SNE): A tool for eco-physiological transcriptomic analysis. Marine genomics, 51, 100723.

[13] Al-Quraan, M. M. Y. (2024). Federated learning empowered ultra-dense next-generation wireless networks (Doctoral dissertation, University of Glasgow).

[14] Rahman, M., & Khan, M. K. (2023). Mechanisms by which AI-enabled CRM systems influence customer retention and overall business performance: A systematic literature review of empirical findings. International Journal of Business and Economics Insights, 3(1), 31–67. https://doi.org/10.63125/qqe2bm11

[15] Nyati, S. (2018). Transforming telematics in fleet management: Innovations in asset tracking, efficiency, and communication. International Journal of Science and Research (IJSR), 7(10), 1804-1810. Retrieved from https://www.ijsr.net/getabstract.php?paperid=SR24203184230

[16] Aguilar, A. (2023). Lowering Mean Time to Recovery (MTTR) in Responding to System Downtime or Outages: An Application of Lean Six Sigma Methodology. In 13th Annual International Conference on Industrial Engineering and Operations Management.

[17] de Arcaya, J. D. (2024). A Framework for the Operationalization of Analytic Workloads in Complex Distributed Computing Environments (Doctoral dissertation, Universidad de Deusto).

[18] S. K. Gunda, "Comparative Analysis of Machine Learning Models for Software Defect Prediction," 2024 International Conference on Power, Energy, Control and Transmission Systems (ICPECTS), Chennai, India, 2024, pp. 1-6, https://doi.org/10.1109/ICPECTS62210.2024.10780167

[19] Orošnjak, M., Beker, I., Brkljač, N., & Vrhovac, V. (2024). Predictors of Successful Maintenance Practices in Companies Using Fluid Power Systems: A Model-Agnostic Interpretation. Applied Sciences, 14(13), 5921.

[20] Raju, R. K. (2017). Dynamic memory inference network for natural language inference. International Journal of Science and Research (IJSR), 6(2). https://www.ijsr.net/archive/v6i2/SR24926091431.pdf

[21] Ospina Herrera, J. P. (2024). Architecture for distributed systems that facilitates a cloud-native AIOps implementations.

[22] Oladoja, T. (2024). Exploring the Role of Explainable AI and Automated Solutions in Crisis Management, Healthcare, and IT Performance.

[23] Özcan, B., & Zhang, X. (2023). Carbon emission-aware job scheduling for Kubernetes deployments. The Journal of Supercomputing, 80, 549–569. https://doi.org/10.1007/s11227-023-05506-7

[24] Baziana, P. A. (2024). Optical data center networking: A comprehensive review on traffic, switching, bandwidth allocation, and challenges. IEEE Access.

[25] Sachdeva, S. (2023). Kubernetes and Docker: An introduction to container orchestration and management. International Journal of Computer Trends and Technology, 71(8), 57–62. https://doi.org/10.14445/22312803/IJCTT-V71I8P109

[26] Archibald, R., Chow, E. D’Azevedo, E., Dongarra, J., Eisenbach, M., Febbo, R., Lopez, F., Nichols, D., Tomov, S., Wong, K., & others. (2020). Integrating deep learning in domain sciences at exascale. arXiv. https://arxiv.org/abs/2011.11188

[27] Joy, M., Venkataramanan, S., Ahmed, M., Mark, M., Gudala, L., Shaik, M., ... & Reddy Vangoor, V. K. (2024). AIOps in Action: Streamlining IT Operations Through Artificial Intelligence. AIOps in Action: Streamlining IT Operations Through Artificial Intelligence," International Journal of Intelligent Systems and Applications in Engineering, 12(23s), 2175-2185.

[28] Singh, V. (2022). Integrating large language models with computer vision for enhanced image captioning: Combining LLMS with visual data to generate more accurate and context-rich image descriptions. Journal of Artificial Intelligence and Computer Vision, 1(E227). http://doi.org/10.47363/JAICC/2022(1)E227

[29] Mousavi, S. F., Esmaeilian, G., Behdad, S., & Wang, J. (2024). Sustainability, resiliency, and artificial intelligence in supplier selection: A triple-themed review. Sustainability, 16(19), 8325. https://doi.org/10.3390/su16198325

[30] Dhinakaran, D., Sankar, S. M., Selvaraj, D., & Raja, S. E. (2024). Privacy-preserving data in IoT-based cloud systems: A comprehensive survey with AI integration. arXiv preprint arXiv:2401.00794.

[31] Sai Krishna Gunda (2024). Smart Device for Object-Oriented Software Prototype (UK Registered Design No. 6400739). Registered with the UK Intellectual Property Office, Class 14-02, granted in November 2024.

[32] Sohana, S., Pourmajidi, W., Steinbacher, J., Miranskyy, A., & others (2024). CloudHeatMap: Heatmap-Based Monitoring for Large-Scale Cloud Systems. arXiv preprint arXiv:2410.21092. https://doi.org/10.48550/arXiv.2410.21092

[33] Yeruva, A. R., & Ramu, V. B. (2023). AIOps research innovations, performance impact and challenges faced. International Journal of System of Systems Engineering, 13(3), 229-247.

[34] Chavan, A. (2024). Fault-tolerant event-driven systems: Techniques and best practices. Journal of Engineering and Applied Sciences Technology, 6, E167. http://doi.org/10.47363/JEAST/2024(6)E167

[35] Mustyala, A., & Tatineni, S. (2021). Cost optimization strategies for Kubernetes deployments in cloud environments. ESP Journal of Engineering & Technology Advancements, 1(1), 34–46. https://doi.org/10.56472/25832646/ESP-V1I1P107

Downloads

Published

2025-05-20

Issue

Section

Articles

How to Cite

1.
Vanama SKR. AI Report - Federated AIOps for Multi-Cluster OpenShift. IJAIBDCMS [Internet]. 2025 May 20 [cited 2026 Mar. 15];6(2):96-108. Available from: https://ijaibdcms.org/index.php/ijaibdcms/article/view/336