Availability without Recovery: A Class of Failures Not Captured by SLAs
DOI:
https://doi.org/10.63282/3050-9416.IJAIBDCMS-V3I3P118Keywords:
Availability, Recovery, SLA Limitations, Resilience Engineering, Fault Tolerance, Incident Response, Service Continuity, Observability, Reliability Metrics, MTTRAbstract
Availability is usually regarded as the most important reliability goal: if the service stays "up,” then the users should be able to work, the revenue should continue, and their trust should be preserved. But being responsive to this single dimension, a system can still neglect the other crucial attribute of resilience recovery. Thus we highlight the gap between the real meaning of availability and the way it is referred to in the industry. While the SLAs only talk about system uptime or error rates, hardly ever, the capability of a system to quickly, safely, and fully recover from faults is taken into account. So that, even under a scenario of prolonged network degradation, recurring problems, or half-working devices that never seem to return to normal, the contract availability requirements can be considered met by the business. We define such a situation as availability without recovery: a scenario where services thus on paper avail themselves (fulfilling health checks and SLA criteria) but still manifest a constant incapacity to heal performance, correctness, or capacity post-disruptions. Besides setting up a framework for such a phenomenon, we give examples of the failure's fingerprints (e.g., repeatedly attempting an operation, backlog accumulating, reliance on services functioning at a degraded level, and "green dashboards" camouflage loss of functionality), and recovery-specific metrics complementary to conventional availability metrics. In a way through a distributed production system example from real life, we have demonstrated that systems which are inheriting dependency cascades, automations gone wrong, and suffering from a silent shortage of resources can keep high levels of availability at the same time that recovery cycles fail.
References
1. Snow, Andrew P., and Gary R. Weckman. "What are the chances an availability SLA will be violated?." Sixth International Conference on Networking (ICN'07). IEEE, 2007.
2. Undheim, Astrid, Ameen Chilwan, and Poul Heegaard. "Differentiated availability in cloud computing SLAs." 2011 IEEE/ACM 12th International Conference on Grid Computing. IEEE, 2011.
3. Parakala, Adityamallikarjunkumar, and Aaron Bell. "How Citizen Developers Changed the Game." American International Journal of Computer Science and Technology 3.5 (2021): 14-24.
4. Gonzalez, Andres J., and Bjarne E. Helvik. "Guaranteeing service availability in SLAs; a study of the risk associated with contract period and failure process." IEEE Latin America Transactions 8.4 (2010): 410-416.
5. Suryadevara, Siva Sai Krishna. “Knowledge-Graph-Enabled Tagging and Taxonomy Automation Framework”. American International Journal of Computer Science and Technology, vol. 4, no. 1, Jan. 2022, pp. 77-89.
6. Hogben, Giles, and Alain Pannetrat. "Mutant apples: a critical examination of cloud SLA availability definitions." 2013 IEEE 5th International Conference on Cloud Computing Technology and Science. Vol. 1. IEEE, 2013.
7. Di Martino, Catello, et al. "Analysis and diagnosis of SLA violations in a production SaaS cloud." IEEE Transactions on Reliability 66.1 (2017): 54-75.
8. Katangoori, Sivadeep, and Anudeep Katangoori. “AI-Augmented Data Governance: Enabling Intelligent Access, Lineage, and Compliance Across Hybrid Clouds”. American Journal of Autonomous Systems and Robotics Engineering, vol. 1, Nov. 2021, pp. 716-38
9. Chan, Chun K., et al. "The role of SLAs in reducing vulnerabilities and recovering from disasters." Bell Labs Technical Journal 9.2 (2004): 189-203.
10. Gaddam, Rohit Reddy. “Vertex AI As a Unified Control Plane for MLOps”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 2, no. 2, June 2021, pp. 92-102
11. Benlarbi, Saida. "Estimating SLAs availability/reliability in multi-services IP networks." International Service Availability Symposium. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006.
12. Muppaneni, Kavya. “Comparative Analysis of Client-Side Storage Mechanisms”. International Journal of AI, BigData, Computational and Management Studies, vol. 3, no. 1, Mar. 2022, pp. 171-82.
13. Gonzalez, Andres J., and Bjarne E. Helvik. "System management to comply with SLA availability guarantees in cloud computing." CloudCom. 2012.
14. Schmidt, Klaus. High availability and disaster recovery: concepts, design, implementation. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006.
15. Muppaneni, Rajarshi Krishna. “How Enterprises Are Achieving 360° Customer Views With Dynamics 365”. International Journal of AI, BigData, Computational and Management Studies, vol. 2, no. 2, June 2021, pp. 129-38
16. Serrano, Damián, et al. "SLA guarantees for cloud services." Future Generation Computer Systems 54 (2016): 233-246.
17. Andrade, Ermeson, et al. "Availability modeling and analysis of a disaster-recovery-as-a-service solution." Computing 99.10 (2017): 929-954.
18. Gaddam, Rohit Reddy. “Hermetic ML Environments Using Conda-Lock and Docker”. American International Journal of Computer Science and Technology, vol. 3, no. 4, July 2021, pp. 22-34
19. Lu, Kuan, et al. "Fault-tolerant service level agreement lifecycle management in clouds using actor system." Future Generation Computer Systems 54 (2016): 247-259.
20. Clemente, Roberto, et al. "Risk management in availability SLA." DRCN 2005). Proceedings. 5th International Workshop on Design of Reliable Communication Networks, 2005.. IEEE, 2005.
21. Kumar Doodala, Appala Nooka, and Swathi Thatraju. “NLP-Driven Benefits Interpretation Engine for Personalized Member Communication”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 3, no. 1, Mar. 2022, pp. 173-8
22. Garraghan, Peter, et al. "Emergent failures: Rethinking cloud reliability at scale." IEEE Cloud Computing 5.5 (2018): 12-21.
23. Parakala, Adityamallikarjunkumar. "Building Analytics-Driven Bots: RPA Meets Business Intelligence." International Journal of Emerging Research in Engineering and Technology 2.1 (2021): 77-87.
24. Lumpp, Th, et al. "From high availability and disaster recovery to business continuity solutions." IBM Systems Journal 47.4 (2008): 605-619.