From Detection to Action: AI-Driven Anomaly Detection and Root Cause Synthesis for Cloud Infrastructure Operations
DOI:
https://doi.org/10.63282/3050-9416.IJAIBDCMS-V7I2P126Keywords:
AI-Driven Anomaly Detection, Cloud Infrastructure Operations, Root Cause Analysis, Root Cause Synthesis, Observability, AIOps (Artificial Intelligence for IT Operations), Incident Management, Time Series Analysis, Event Correlation, Fault Detection and Diagnosis, Machine Learning in Cloud Operations, Automated Remediation, Log and Metric Analytics, Distributed Systems Monitoring, Predictive MaintenanceAbstract
Cloud infrastructure operations teams face a persistent gap between observability and understanding: monitoring platforms surface anomalies but leave root cause investigation to engineers, typically requiring 60–90 minutes of manual dashboard navigation, log search, and historical trend analysis per incident. We present ARGUS, a closed-loop observability system that bridges this gap by automatically progressing from anomaly detection to actionable root cause guidance without human intervention. ARGUS combines three components: a four-algorithm detection ensemble Z-Score, Moving Average, IQR, and Trend Analysis targeting distinct anomaly shapes in Prometheus metrics; an automated log correlation engine that queries Grafana Loki to surface relevant evidence for each detected anomaly; and a Llama-3.3-70B-Instruct-based synthesis module that generates structured natural language root cause analysis covering root cause with confidence rating, corroborating analysis, a log excerpt from the anomaly window, and recommended immediate and follow-up actions. Deployed on Azure Kubernetes Service and operated in production for five months across a partner-facing insurance quote API platform, ARGUS has processed over 100 anomaly events spanning both infrastructure and business metrics including partner impression and click-through rates using the same Prometheus pipeline. Evaluation demonstrates a reduction in time to first actionable insight from approximately 90 minutes to under 5 minutes, a greater than 94% improvement, with LLM-generated root cause analyses rated accurate in 80–85% of assessed incidents. The architecture relies exclusively on open-source observability tooling and an open-weight language model, making it reproducible by any team operating a Prometheus and Loki stack without proprietary cloud dependencies. anomaly detection, AIOps, root cause analysis, observability, large language models, cloud operations
References
1. B. Beyer, N. R. Murphy, D. K. Rensin, K. Kawahara, and S. Thorne, The Site Reliability Workbook: Practical Ways to Implement SRE. O’Reilly Media, 2018.
2. Leivadeas et al., “Mitigating Alert Fatigue in Cloud Monitoring Systems: A Machine Learning Perspective,” Computer Networks, vol. 250, 2024.
3. S. J. Taylor and B. Letham, “Forecasting at Scale,” The American Statistician, vol. 72, no. 1, pp. 37–45, 2018.
4. Y. Dang, Q. Lin, and P. Huang, “AIOps: Real-World Challenges and Research Innovations,” in Proc. 41st ICSE Companion, 2019, pp. 4–5.
5. C. Majors, L. Fong-Jones, and G. Miranda, Observability Engineering. O’Reilly Media, 2022.
6. B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure,” Google Technical Report, 2010.
7. B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media, 2016.
8. M. Chen et al., “Evaluating Large Language Models Trained on Code,” arXiv preprint arXiv:2107.03374, 2021.
9. Y. Chen, H. Xie, M. Ma et al., “Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents,” in Proc. 19th European Conf. on Computer Systems (EuroSys), 2024.
10. Z. Wang, J. Liu, J. Huang et al., “RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models,” in Proc. 33rd ACM CIKM, 2024.
11. D. M. Hawkins, Identification of Outliers. Chapman and Hall, 1980.
12. E. S. Page, “Continuous Inspection Schemes,” Biometrika, vol. 41, no. 1–2, pp. 100–115, 1954.
13. F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation Forest,” in Proc. 8th IEEE ICDM, 2008, pp. 413–422.
14. K. Hundman, V. Constantinou, C. Laporte, I. Colwell, and T. Soderstrom, “Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding,” in Proc. 24th ACM KDD, 2018, pp. 387–395.
15. O. Mart, C. Negru, F. Pop, and A. Castiglione, “Observability in Kubernetes Cluster: Automatic Anomalies Detection using Prometheus,” in IEEE 22nd HPCC, 2020.
16. F. Fouquet et al., “Synthetic Time Series for Anomaly Detection in Cloud Microservices,” arXiv preprint arXiv:2408.00006, 2024.
17. H. Cheng, D. P. Sahoo et al., “AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges,” arXiv preprint arXiv:2304.04661, 2023.
18. J. Kuang, J. Liu, J. Huang et al., “Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: A Hybrid Approach,” in Proc. 46th ICSE SEIP, 2024.
19. OpenAI, “GPT-4 Technical Report,” arXiv preprint arXiv:2303.08774, 2023.
20. P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An Online Log Parsing Approach with Fixed Depth Tree,” in Proc. IEEE ICWS, 2017, pp. 33–40.
21. M. Du, F. Li, G. Zheng, and V. Srikumar, “DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning,” in Proc. ACM CCS, 2017, pp. 1285–1298.
22. H. Guo, S. Yuan, and X. Wu, “LogBERT: Log Anomaly Detection via BERT,” in Proc. IJCNN, 2021.
23. S. Nedelkoski et al., “Deep Learning for Anomaly Detection in Log Data: A Survey,” arXiv preprint arXiv:2207.03820, 2022.
24. Q. Guo et al., “LogAI: A Library for Log Analytics and Intelligence,” arXiv preprint arXiv:2301.13415, 2023.
25. M. Freedman, E. Kite, and R. Borovica-Gajic, “TimescaleDB: Creating the Best of Both Worlds for Time-Series Data,” in Proc. NWDB, 2018.