Reinforcement Learning for Adaptive Resource Management in Cloud Systems
DOI:
https://doi.org/10.63282/3050-9416.ICAIDSCT26-129Keywords:
Reinforcement Learning, Cloud Resource Management, Adaptive Scheduling, Auto-Scaling, Dynamic Resource Allocation, Cloud Computing, Intelligent Orchestration, Performance Optimization, QoS Management, AI-driven Cloud SystemsAbstract
Cloud software systems operate under unpredictable and constantly changing workloads. Static provisioning and rule-based auto-scaling strategies often respond too late to performance issues or waste resources during low-demand periods. This creates a tradeoff between cost efficiency and service reliability that traditional approaches struggle to balance. This paper presents a reinforcement learning based framework for adaptive cloud resource management. The system learns how to allocate computing resources by interacting with the cloud environment and observing the long-term outcomes of its decisions. Cloud management is modeled as a sequential decision process where the learning agent balances performance, cost, and service-level agreement compliance. We evaluate the proposed approach using simulated cloud workloads and compare it with threshold-based and reactive scaling strategies. Results show improved resource utilization, reduced SLA violations, and smoother adaptation to workload changes. The findings suggest that reinforcement learning offers a practical foundation for building self-adaptive cloud systems that improve over time without manual rule tuning.
References
1. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
2. Mao, Y., Alizadeh, M., Menache, I., & Kandula, S. (2016). Resource management with deep reinforcement learning. HotNets.
3. Chen, T., Zhang, Z., Mao, Y., & Li, B. (2018). Self-adaptive resource allocation using reinforcement learning. IEEE CLOUD.
4. Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: Principles and Practice. OTexts.
5. Xu, H., & Li, B. (2013). Dynamic cloud pricing for revenue maximization. IEEE Transactions on Cloud Computing.
6. Mao, M., & Humphrey, M. (2011). Auto-scaling to minimize cost and meet application deadlines in cloud workflows. SC Companion.
7. Chen, M., et al. (2020). Machine learning for system reliability: A survey. IEEE Transactions on Reliability.
8. Dean, J., & Barroso, L. A. (2013). The tail at scale. Communications of the ACM.
9. Ghodsi, A., et al. (2011). Dominant resource fairness. ACM SIGCOMM.
10. Agarwal, P. K., et al. (2016). Site Reliability Engineering: How Google Runs Production Systems. O’Reilly.