An Empirical Evaluation of the Medallion Architecture on Databricks and Apache Spark with Snowflake: Throughput, Latency, and Cost for Batch and Real-Time Ingestion Patterns

Laxmi Madhu Kumar Brahmandam

doi:10.63282/3050-9416.IJAIBDCMS-V5I3P122

Authors

Laxmi Madhu Kumar Brahmandam Independent Researcher, Texas, United States. Author

DOI:

https://doi.org/10.63282/3050-9416.IJAIBDCMS-V5I3P122

Keywords:

Medallion Architecture, Apache Spark Structured Streaming, Delta Lake, Snowflake Integration, Change Data Capture, Lakehouse Evaluation

Abstract

Enterprise analytical platforms increasingly combine batch and real-time ingestion within a single architectural envelope, and the Bronze/Silver/Gold medallion pattern has become a widely cited reference design for organizing such pipelines on a lakehouse substrate. Despite broad adoption, peer-reviewed evidence quantifying throughput, end-to-end latency, and unit cost across heterogeneous ingestion paths remains limited. This paper presents an empirical evaluation of the medallion architecture implemented on Databricks and Apache Spark, with Snowflake serving as the gold-layer warehouse for downstream consumption. The study synthesizes observations from a set of production deployments that ingest file landings via Auto Loader, event streams via Apache Kafka, and database change events via Debezium-based change data capture. The methodology fixes cluster configuration, defines a steady-state measurement window, and records throughput, p99 end-to-end latency from source event time to gold-layer availability, and cost per terabyte ingested across three scale tiers. The data show that micro-batch Structured Streaming on Delta Lake sustains between 18 and 62 MB/s per active task across the studied configurations, with p99 latencies between 9 and 47 seconds and observed unit costs between 1.8 and 6.4 USD per terabyte ingested depending on the source and tier. The paper also reports schema-evolution and data-quality outcomes that bear on operational trust. The findings inform reference design choices for organizations building hybrid lakehouse-warehouse platforms and have implications for the broader field of data-intensive systems engineering.

References

1. Armbrust, M., Das, T., Sun, L., Yavuz, B., Zhu, S., Murthy, M., Torres, J., van Hovell, H., Ionescu, A., Luszczak, A., Switakowski, M., Szafranski, M., Li, X., Ueshin, T., Mokhtar, M., Boncz, P., Ghodsi, A., Paranjpye, S., Senster, P., Xin, R., and Zaharia, M. Delta Lake: high-performance ACID table storage over cloud object stores. Proceedings of the VLDB Endowment, 13(12), 2020. https://scholar.google.com/scholar?q=Armbrust, M., Das, T., Sun, L., Yavuz, B., Zhu, S., Murthy, M., Torres, J., van Hovell, H., Ionescu, A., Luszczak, A., Switakowski, M., Szafranski, M., Li, X., Ueshin, T., Mokhtar, M., Boncz, P., Ghod | https://doi.org/10.14778/3415478.3415560

2. Armbrust, M., Ghodsi, A., Xin, R., and Zaharia, M. Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. CIDR, 2021. | https://scholar.google.com/scholar?q=Armbrust, M., Ghodsi, A., Xin, R., and Zaharia, M. Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. CIDR, 2021.

3. Zaharia, M., Chowdhury, M., Franklin, M., Shenker, S., and Stoica, I. Spark: cluster computing with working sets. HotCloud, 2010. | https://scholar.google.com/scholar?q=Zaharia, M., Chowdhury, M., Franklin, M., Shenker, S., and Stoica, I. Spark: cluster computing with working sets. HotCloud, 2010.

4. Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., and Stoica, I. Discretized streams: fault-tolerant streaming computation at scale. SOSP, 2013. | https://scholar.google.com/scholar?q=Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., and Stoica, I. Discretized streams: fault-tolerant streaming computation at scale. SOSP, 2013.

5. Armbrust, M., Das, T., Torres, J., Yavuz, B., Zhu, S., Xin, R., Ghodsi, A., Stoica, I., and Zaharia, M. Structured Streaming: a declarative API for real-time applications in Apache Spark. SIGMOD, 2018. | https://scholar.google.com/scholar?q=Armbrust, M., Das, T., Torres, J., Yavuz, B., Zhu, S., Xin, R., Ghodsi, A., Stoica, I., and Zaharia, M. Structured Streaming: a declarative API for real-time applications in Apache Spark. SIGMOD, 2018

6. Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., and Tzoumas, K. Apache Flink: stream and batch processing in a single engine. IEEE Data Engineering Bulletin, 36(4), 2015. | https://scholar.google.com/scholar?q=Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi, S., and Tzoumas, K. Apache Flink: stream and batch processing in a single engine. IEEE Data Engineering Bulletin, 36(4), 2015.

7. Kreps, J., Narkhede, N., and Rao, J. Kafka: a distributed messaging system for log processing. NetDB workshop at SIGMOD, 2011. | https://scholar.google.com/scholar?q=Kreps, J., Narkhede, N., and Rao, J. Kafka: a distributed messaging system for log processing. NetDB workshop at SIGMOD, 2011.

8. Wang, G., Koshy, J., Subramanian, S., Paramasivam, K., Zadeh, M., Narkhede, N., Rao, J., Kreps, J., and Stein, J. Building a replicated logging system with Apache Kafka. PVLDB, 8(12), 2015. | https://scholar.google.com/scholar?q=Wang, G., Koshy, J., Subramanian, S., Paramasivam, K., Zadeh, M., Narkhede, N., Rao, J., Kreps, J., and Stein, J. Building a replicated logging system with Apache Kafka. PVLDB, 8(12), 2015.

9. Stonebraker, M. and Cetintemel, U. One size fits all: an idea whose time has come and gone. ICDE, 2005. | https://scholar.google.com/scholar?q=Stonebraker, M. and Cetintemel, U. One size fits all: an idea whose time has come and gone. ICDE, 2005.

10. Kimball, R. and Ross, M. The Data Warehouse Toolkit, Third Edition. Wiley, 2013. | https://scholar.google.com/scholar?q=Kimball, R. and Ross, M. The Data Warehouse Toolkit, Third Edition. Wiley, 2013.

11. Inmon, W. H. Building the Data Warehouse, Fourth Edition. Wiley, 2005. | https://scholar.google.com/scholar?q=Inmon, W. H. Building the Data Warehouse, Fourth Edition. Wiley, 2005.

12. Linstedt, D. and Olschimke, M. Building a Scalable Data Warehouse with Data Vault 2.0. Morgan Kaufmann, 2015. | https://scholar.google.com/scholar?q=Linstedt, D. and Olschimke, M. Building a Scalable Data Warehouse with Data Vault 2.0. Morgan Kaufmann, 2015.

13. Dageville, B., Cruanes, T., Zukowski, M., Antonov, V., Avanes, A., Bock, J., Claybaugh, J., Engovatov, D., Hentschel, M., Huang, J., Lee, A. W., Motivala, A., Munir, A. Q., Pelley, S., Povinec, P., Rahn, G., Triantafyllis, S., and Unterbrunner, P. The Snowflake elastic data warehouse. SIGMOD, 2016. | https://scholar.google.com/scholar?q=Dageville, B., Cruanes, T., Zukowski, M., Antonov, V., Avanes, A., Bock, J., Claybaugh, J., Engovatov, D., Hentschel, M., Huang, J., Lee, A. W., Motivala, A., Munir, A. Q., Pelley, S., Povinec, P., Ra

14. Vohra, D. Apache Parquet. Practical Hadoop Ecosystem, Springer, 2016. | https://scholar.google.com/scholar?q=Vohra, D. Apache Parquet. Practical Hadoop Ecosystem, Springer, 2016. | https://scholar.google.com/scholar?q=Vohra, D. Apache Parquet. Practical Hadoop Ecosystem, Springer, 2016.

15. Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., and Vassilakis, T. Dremel: interactive analysis of web-scale datasets. PVLDB, 3(1), 2010. | https://scholar.google.com/scholar?q=Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., and Vassilakis, T. Dremel: interactive analysis of web-scale datasets. PVLDB, 3(1), 2010.

16. Behm, A., Palkar, S., Agarwal, U., Armbrust, M., Cashman, D., Chambi, S., Datta, S., Falaki, H., Jindal, A., Liang, Y., et al. Photon: a fast query engine for lakehouse systems. SIGMOD, 2022. | https://scholar.google.com/scholar?q=Behm, A., Palkar, S., Agarwal, U., Armbrust, M., Cashman, D., Chambi, S., Datta, S., Falaki, H., Jindal, A., Liang, Y., et al. Photon: a fast query engine for lakehouse systems. SIGMOD, 2022.

17. Apache Software Foundation. Apache Spark Structured Streaming Programming Guide. | https://scholar.google.com/scholar?q=Apache Software Foundation. Apache Spark Structured Streaming Programming Guide. | https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

18. Databricks Inc. Auto Loader documentation. | https://scholar.google.com/scholar?q=Databricks Inc. Auto Loader documentation. | https://docs.databricks.com/ingestion/auto-loader/index.html

19. Databricks Inc. Delta Live Tables documentation. | https://scholar.google.com/scholar?q=Databricks Inc. Delta Live Tables documentation. | https://docs.databricks.com/delta-live-tables/index.html

20. Databricks Inc. Unity Catalog documentation. | https://scholar.google.com/scholar?q=Databricks Inc. Unity Catalog documentation. | https://docs.databricks.com/data-governance/unity-catalog/index.html

21. Snowflake Inc. Spark Snowflake Connector documentation. | https://scholar.google.com/scholar?q=Snowflake Inc. Spark Snowflake Connector documentation. | https://docs.snowflake.com/en/user-guide/spark-connector

22. Snowflake Inc. External Tables documentation. | https://scholar.google.com/scholar?q=Snowflake Inc. External Tables documentation. | https://docs.snowflake.com/en/user-guide/tables-external-intro

23. Debezium project. Debezium documentation. | https://scholar.google.com/scholar?q=Debezium project. Debezium documentation. | https://debezium.io/documentation/

24. Chen, A., Chow, A., Davidson, A., DCunha, A., Ghodsi, A., Hong, S. A., Konwinski, A., Mewald, C., Murching, S., Nykodym, T., et al. Developments in MLflow: a system to accelerate the machine learning lifecycle. DEEM at SIGMOD, 2020. | https://scholar.google.com/scholar?q=Chen, A., Chow, A., Davidson, A., DCunha, A., Ghodsi, A., Hong, S. A., Konwinski, A., Mewald, C., Murching, S., Nykodym, T., et al. Developments in MLflow: a system to accelerate the machine learning

25. Hellerstein, J. M., Sreekanti, V., Gonzalez, J. E., Dalton, J., Dey, A., Nag, S., Ramachandran, K., Arora, S., Bhattacharyya, A., Das, S., et al. Ground: a data context service. CIDR, 2017. | https://scholar.google.com/scholar?q=Hellerstein, J. M., Sreekanti, V., Gonzalez, J. E., Dalton, J., Dey, A., Nag, S., Ramachandran, K., Arora, S., Bhattacharyya, A., Das, S., et al. Ground: a data context service. CIDR, 2017.

An Empirical Evaluation of the Medallion Architecture on Databricks and Apache Spark with Snowflake: Throughput, Latency, and Cost for Batch and Real-Time Ingestion Patterns

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

Callpaper

Menu

Information

Keywords

Latest publications