Scalable Data Architectures for Real-Time Big Data Analytics: A Comparative Study of Hadoop, Spark, and Kafka

Authors

  • Dr. Johan Muller Technical University of Munich, AI & Big Data Lab, Germany Author
  • Dr. Linda Fischer University of Stuttgart, AI Research Hub, Germany. Author

DOI:

https://doi.org/10.63282/3050-9416.IJAIBDCMS-V1I4P102

Keywords:

Hadoop, Spark, Kafka, big data, batch processing, real-time processing, scalability, performance, data pipelines, machine learning

Abstract

In the era of big data, the ability to process and analyze vast amounts of data in real-time is crucial for businesses and organizations to gain actionable insights. This paper presents a comprehensive comparative study of three prominent big data processing frameworks: Hadoop, Spark, and Kafka. Each framework is evaluated based on its scalability, performance, ease of use, and suitability for real-time data processing. The study includes a detailed analysis of the architectural components, algorithms, and use cases for each framework. Additionally, the paper provides a comparative evaluation through benchmark tests and real-world scenarios to highlight the strengths and weaknesses of each technology. The findings of this study aim to assist data engineers and architects in selecting the most appropriate framework for their specific big-data processing needs

References

1. Borthakur, D. (2007). The Hadoop distributed file system: Architecture and design. The Apache Software Foundation.

2. Chintapalli, S., Dagit, D., Evans, B., Farivar, R., Holderbaugh, M., Liu, Z., ... & Ryza, S. (2016). Benchmarking streaming computation engines: Storm, Flink, Spark, and Samza. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (pp. 1789-1792). IEEE.

3. Garg, N. (2013). HBase: The definitive guide. O'Reilly Media.

4. Gates, A., & Nadeau, J. (2014). Programming Pig. O'Reilly Media.

5. Gopalani, S., & Arora, R. (2015). Comparing Apache Spark and MapReduce with performance analysis using K-means. International Journal of Computer Applications, 113(1), 8-11.

6. Grolinger, K., Higashino, W. A., Tiwari, A., & Capretz, M. A. M. (2014). Data management in cloud environments: NoSQL and NewSQL data stores. Journal of Cloud Computing: Advances, Systems and Applications, 3(1), 1-24.

7. Hemsoth, N. (2015, June 23). Kafka and Spark Streaming: Real-time friends. The Next Platform.

8. Hewitt, E. (2011). Cassandra: The definitive guide. O'Reilly Media.

9. Huang, J., Huang, S., Dai, J., Xie, T., & Huang, B. (2010). The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010) (pp. 41-51). IEEE.

10. Kreps, J., Narkhede, N., & Rao, J. (2011). Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB (Vol. 11, pp. 1-7).

Downloads

Published

2020-12-05

Issue

Section

Articles

How to Cite

1.
Muller J, Fischer L. Scalable Data Architectures for Real-Time Big Data Analytics: A Comparative Study of Hadoop, Spark, and Kafka. IJAIBDCMS [Internet]. 2020 Dec. 5 [cited 2025 Oct. 26];1(4):8-18. Available from: https://ijaibdcms.org/index.php/ijaibdcms/article/view/24