Scalable Data Architectures for Real-Time Big Data Analytics: A Comparative Study of Hadoop, Spark, and Kafka
DOI:
https://doi.org/10.63282/3050-9416.IJAIBDCMS-V1I4P102Keywords:
Hadoop, Spark, Kafka, big data, batch processing, real-time processing, scalability, performance, data pipelines, machine learningAbstract
In the era of big data, the ability to process and analyze vast amounts of data in real-time is crucial for businesses and organizations to gain actionable insights. This paper presents a comprehensive comparative study of three prominent big data processing frameworks: Hadoop, Spark, and Kafka. Each framework is evaluated based on its scalability, performance, ease of use, and suitability for real-time data processing. The study includes a detailed analysis of the architectural components, algorithms, and use cases for each framework. Additionally, the paper provides a comparative evaluation through benchmark tests and real-world scenarios to highlight the strengths and weaknesses of each technology. The findings of this study aim to assist data engineers and architects in selecting the most appropriate framework for their specific big-data processing needs
References
1. Borthakur, D. (2007). The Hadoop distributed file system: Architecture and design. The Apache Software Foundation.
2. Chintapalli, S., Dagit, D., Evans, B., Farivar, R., Holderbaugh, M., Liu, Z., ... & Ryza, S. (2016). Benchmarking streaming computation engines: Storm, Flink, Spark, and Samza. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 1789-1792). IEEE.
3. Garg, N. (2013). HBase: The definitive guide. O'Reilly Media.
4. Gates, A., & Nadeau, J. (2014). Programming Pig. O'Reilly Media.
5. Gopalani, S., & Arora, R. (2015). Comparing Apache Spark and MapReduce with performance analysis using K-means. International Journal of Computer Applications, 113(1), 8-11.
6. Grolinger, K., Higashino, W. A., Tiwari, A., & Capretz, M. A. M. (2014). Data management in cloud environments: NoSQL and NewSQL data stores. Journal of Cloud Computing: Advances, Systems and Applications, 3(1), 1-24.
7. Hemsoth, N. (2015, June 23). Kafka and Spark Streaming: Real-time friends. The Next Platform.
8. Hewitt, E. (2011). Cassandra: The definitive guide. O'Reilly Media.
9. Huang, J., Huang, S., Dai, J., Xie, T., & Huang, B. (2010). The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010) (pp. 41-51). IEEE.
10. Kreps, J., Narkhede, N., & Rao, J. (2011). Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB (Vol. 11, pp. 1-7).