End-to-End Visibility in Cloud Deployments: Building Real-Time Program Health Systems
DOI:
https://doi.org/10.63282/3050-9416.IJAIBDCMS-V6I4P117Keywords:
Cloud Observability, Telemetry Correlation, Anomaly Detection, Real-Time Monitoring, Program Health, Cloud-Native SystemsAbstract
Cloud-native applications now run across distributed services, containers, and serverless functions, each emitting its own logs, metrics, traces, and events. While modern observability tools collect these signals effectively, they tend to process them in isolation, leaving engineers to manually correlate symptoms during incidents. This fragmentation slows detection, clouds root-cause analysis, and weakens real-time understanding of program health. This paper introduces the Real-Time Program Health (RTPH) Framework, a multi-layer model that unifies telemetry ingestion, real-time stream processing, machine-learning-based anomaly detection, and health scoring into a single, interpretable view of system behavior. RTPH is evaluated in a hybrid cloud environment running microservice workloads on Kubernetes, with synthetic faults injected under controlled conditions. Its performance is compared against established observability stacks that include metrics, logging, and tracing tools. Experimental results show that RTPH reduces anomaly detection latency by 32–45%, lowers false-positive alerts by 28–40%, and correctly correlates 87–93% of cross-service anomalies, while keeping CPU and memory overhead below 8% and 6% per node, respectively. These findings indicate that unified, real-time health modeling can provide more accurate, actionable visibility into cloud deployments than traditional, signal-specific monitoring approaches
References
[1] B. Sigelman et al., “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure,” Google, Technical Report, 2010.
[2] P. Sharma, S. Rathore, and S. Park, “A Survey on Monitoring and Observability of Cloud-Native Systems,” IEEE Access, vol. 9, pp. 162785–162802, 2021.
[3] R. Heinrich et al., “Architectural Metrics for Microservice Monitoring,” in Proc. IEEE/IFIP Conf. on Software Architecture (ICSA), 2020, pp. 145–154.
[4] C. Heger, A. van Hoorn, and D. Okanovic, “Application Performance Monitoring: From Black Box to Open Observability,” ACM Comput. Surveys, vol. 54, no. 4, pp. 1–35, 2022.
[5] R. Burns, “Observability for Modern Applications,” ACM Queue, vol. 19, no. 5, 2021.
[6] OpenTelemetry, “OpenTelemetry Project Documentation,” CNCF, 2023. [Online]. Available: https://opentelemetry.io
[7] A. Mukhopadhyay et al., “Survey of Machine Learning Techniques for Anomaly Detection in Cloud Systems,” IEEE Trans. on Neural Networks and Learning Systems, vol. 33, no. 9, pp. 4485–4505, 2022.