Review of Streaming ETL Pipelines for Data Warehousing: Tools, Techniques, and Best Practices

Authors

  • Vaibhav Maniar Oklahoma City University, MBA / Product Management. Author
  • Vetrivelan Tamilmani Principal Consultant (SAP), Infosys Ltd. Author
  • Rami Reddy Kothamaram California University of Management and Science, MS In Computer Information Systems Author
  • Dinesh Rajendran Coimbatore Institute of Technology, MSC. Software Engineering. Author
  • Venkata Deepak Namburi University of Central Missouri, Department of Computer Science. Author
  • Aniruddha Arjun Singh Singh ADP, Sr. Implementation Project Manager. Author

DOI:

https://doi.org/10.63282/3050-9416.IJAIBDCMS-V2I3P109

Keywords:

Streaming ETL, Data Warehousing, Real-Time Analytics, Data Integration, Big Data, Scalability, Fault Tolerance

Abstract

The fast generation of data-based applications has intensified the burden on the necessity to execute data integration into the warehousing systems in real-time and efficiently. Extract, Transform, Load (ETL) based streaming pipelines have emerged as a key solution choice, and continuous data ingestion, transformation and delivery of data, and timely analytics and decision making can be facilitated. The authors in this review seek to examine the principles and underlying techniques, tools and best practices which enable streaming enabled architectures where special focus is directed towards how they enable scalability, elasticity and fault tolerance within dy3namic data ecosystems. Stream processing models, data coordination schemes and data consistency and quality assurance mechanism of near-real-time processes have been found to have the most significant influence. Implementation in other fields of streaming ETL is also discussed in the paper whereby it has proved to save processing latency, enhance operational efficiency, and, in addition, enhance the reliability of the analytical results. The survey is an elaborate review of the transformation of the current practices of integrating real-time data through the incorporation of the new developments and applications. The results have advice on developing adaptive, intelligent and sustainable data warehousing systems that have the potential to cope with the growing demand of the contemporary businesses and assist the next-generation analytics programs

References

[1] K. Vassakis, E. Petrakis, and I. Kopanakis, “Big data analytics: Applications, prospects and challenges,” Lect. Notes Data Eng. Commun. Technol., vol. 10, no. January, pp. 3–20, 2018, doi: 10.1007/978-3-319-67925-9_1.

[2] V. N. Gudivada, A. Apon, and J. Ding, “Data Quality Considerations for Big Data and Machine Learning: Going Beyond Data Cleaning and Transformations,” Int. J. Adv. Softw., vol. 10, no. 1, pp. 1–20, 2017.

[3] H. G. Kola, “Data Warehousing Solutions for Scalable Etl Pipelines,” J. Sci. Res. Sci. Eng. Technol., vol. 4, no. 8, pp. 762–769, 2018.

[4] F. Xiao, C. Li, Z. Wu, and Y. Wu, “NMSTREAM: A scalable event-driven ETL framework for processing heterogeneous streaming data,” ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci., vol. 4, no. 4, pp. 243–246, 2018, doi: 10.5194/isprs-annals-IV-4-243-2018.

[5] K. Kakish and T. a Kraft, “ETL Evolution for Real-Time Data Warehousing,” Proc. Conf. Inf. Syst. Appl. Res., 2012.

[6] A. Simitsis, P. Vassiliadis, and T. Sellis, “Optimizing ETL Processes in Data Warehouses,” in 21st International Conference on Data Engineering (ICDE’05), IEEE, 2005, pp. 564–575. doi: 10.1109/ICDE.2005.103.

[7] S. Pillai and P. S. Metkewar, “Literature Review of Concerns Prevalent within Real Time Data Warehouse,” Int. J. Trend Res. Dev., vol. 3, no. 4, pp. 370–371, 2016.

[8] N. R. Mandala, “The evolution of ETL architecture: From traditional data warehousing to real-time data integration,” World J. Adv. Res. Rev., vol. 1, no. 3, pp. 073–084, Mar. 2019, doi: 10.30574/wjarr.2019.1.3.0033.

[9] J. Meehan et al., “S-Store,” Proc. VLDB Endow., vol. 8, no. 13, pp. 2134–2145, Sep. 2015, doi: 10.14778/2831360.2831367.

[10] V. A. Kherdekar and P. S. Metkewar, “A Technical Comprehensive Survey of ETL Tools,” Int. J. Appl. Eng. Res., vol. 11, no. 04, Feb. 2016, doi: 10.37622/IJAER/11.4.2016.2557-2559.

[11] A. A. Yulianto, “Extract Transform Load (ETL) Process in Distributed Database Academic Data Warehouse,” APTIKOM J. Comput. Sci. Inf. Technol., vol. 4, no. 2, pp. 61–68, Jul. 2019, doi: 10.11591/APTIKOM.J.CSIT.36.

[12] S. Gupta, N. Agrawal, and S. Gupta, “A Review on Search Engine Optimization: Basics,” Int. J. Hybrid Inf. Technol., vol. 9, no. 5, pp. 381–390, May 2016, doi: 10.14257/ijhit.2016.9.5.32.

[13] S. T and S. N. K, “A study on Modern Messaging Systems- Kafka, RabbitMQ and NATS Streaming,” CoRR abs/1912.03715, 2019.

[14] A. Kushwaha, P. Pathak, and S. Gupta, “Review of optimize load balancing algorithms in cloud,” Int. J. Distrib. Cloud Comput., vol. 4, no. 2, pp. 1–9, 2016.

[15] M. Armbrust et al., “Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark,” ACM SIGCOMM Comput. Commun. Rev., vol. 37, no. 4, pp. 361–372, Oct. 2018, doi: 10.1145/1282427.1282421.

[16] R. Verma, “Real-Time Data Integration: The Next Evolution in ETL,” Int. Res. J. Eng. Technol., vol. 2, no. April, 2015, doi: 10.2139/ssrn.5000978.

[17] J. P. A. Runtuwene, I. R. H. T. Tangkawarow, C. T. M. Manoppo, and R. J. Salaki, “A Comparative Analysis of Extract, Transformation and Loading (ETL) Process,” IOP Conf. Ser. Mater. Sci. Eng., vol. 306, no. 1, p. 012066, Feb. 2018, doi: 10.1088/1757-899X/306/1/012066.

[18] C. Batini, C. Cappiello, C. Francalanci, and A. Maurino, “Methodologies for data quality assessment and improvement,” ACM Comput. Surv., vol. 41, no. 3, pp. 1–52, Jul. 2009, doi: 10.1145/1541880.1541883.

[19] A. Pareek, B. Khaladkar, R. Sen, B. Onat, V. Nadimpalli, and M. Lakshminarayanan, “Real-time ETL in Striim,” in Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics, New York, NY, USA: ACM, Aug. 2018, pp. 1–10. doi: 10.1145/3242153.3242157.

[20] Y. Al-Dhuraibi, F. Paraiso, N. Djarallah, and P. Merle, “Elasticity in Cloud Computing: State of the Art and Research Challenges,” IEEE Trans. Serv. Comput., vol. 11, no. 2, pp. 430–447, 2018, doi: 10.1109/TSC.2017.2711009.

[21] A. Sari and M. Akkaya, “Fault Tolerance Mechanisms in Distributed Systems,” Int. J. Commun. Netw. Syst. Sci., vol. 08, no. 12, pp. 471–482, 2015, doi: 10.4236/ijcns.2015.812042.

[22] N. R. Mandala, “Memory Management in Large-Scale ETL Processes,” Int. J. Nov. Res. Dev., vol. 2, no. 3, pp. 42–48, 2017.

[23] C. Batini et al., “Data quality in remote sensing,” Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. - ISPRS Arch., vol. 42, no. 2W7, pp. 447–453, 2017, doi: 10.5194/isprs-archives-XLII-2-W7-447-2017.

[24] A. Simitsis, P. Vassiliadis, and T. Sellis, “Optimizing ETL Processes in Data Warehouses,” in 21st International Conference on Data Engineering (ICDE’05), IEEE, 2005, pp. 564–575. doi: 10.1109/ICDE.2005.103.

[25] F. Alam and N. Kamal, “Survey on Data Warehouse from Traditional to Realtime and Society Impact of Real Time Data,” Int. J. Comput. Appl., vol. 177, no. 9, pp. 20–24, Oct. 2019, doi: 10.5120/ijca2019919463.

[26] A. Katari, “ETL for Real-Time Financial Analytics : Architectures and Challenges,” Innov. Comput. Sci. J., vol. 5, no. 1, pp. 1–17, 2019.

[27] H. Isah and F. Zulkernine, “A Scalable and Robust Framework for Data Stream Ingestion,” Proc. - 2018 IEEE Int. Conf. Big Data, Big Data 2018, pp. 2900–2905, 2018, doi: 10.1109/BigData.2018.8622360.

[28] P. Chandra and M. K. Gupta, “Comprehensive survey on data warehousing research,” Int. J. Inf. Technol., vol. 10, no. 2, pp. 217–224, Jun. 2017, doi: 10.1007/s41870-017-0067-y.

[29] M. Minhaj, “An Exploratory Study of Near-Real Time ETL Approaches for the Design of Agile Business Intelligence Infrastructure Mohamed Minhaj,” SDM Res. Cent. Manag. Stud., vol. V, pp. 23–44, 2016.

[30] W. Qu, V. Basavaraj, S. Shankar, and S. Dessloch, “Real-Time Snapshot Maintenance with Incremental ETL Pipelines in Data Warehouses,” in Big Data Analytics and Knowledge Discovery, S. Madria and T. Hara, Eds., Cham: Springer International Publishing, 2015, pp. 217–228.

Downloads

Published

2021-10-30

Issue

Section

Articles

How to Cite

1.
Maniar V, Tamilmani V, Kothamaram RR, Rajendran D, Namburi VD, Singh Singh AA. Review of Streaming ETL Pipelines for Data Warehousing: Tools, Techniques, and Best Practices. IJAIBDCMS [Internet]. 2021 Oct. 30 [cited 2025 Nov. 13];2(3):74-81. Available from: https://ijaibdcms.org/index.php/ijaibdcms/article/view/284