Real Time Data Synchronization and Historical Tracking using AWS Data Migration Service and Databricks
DOI:
https://doi.org/10.63282/3050-9416.IJAIBDCMS-V7I2P128Keywords:
Real-Time Data Synchronization, Historical Data Tracking, AWS Data Migration Service (AWS DMS), Databricks, Cloud Data Integration, Change Data Capture (CDC), Data Replication, Big Data Analytics, ETL Automation, Data Lakehouse Architecture, Stream Processing, Data Engineering, Incremental Data Loading, Enterprise Data Migration, Scalable Data Pipeline, Data GovernanceAbstract
There is a growing demand for synchronization of data between databases for workload separation, disaster recovery, historical tracking and for various other use cases. This article proposes a solution which uses AWS DMS (Data Migration Services) to migrate data from source databases into s3 buckets and store it in the form of parquet files. The data in these files is subjected to schema conversion, SCD (Slowly Changing Dimensions) type 2 transformations in Databricks Spark ingestion pipeline and finally appended to the UC (Unity Catalog) table. Finally, UC table data is loaded into various types of target databases. The proposed solution provides better schema validation, higher availability and requires much less configuration and maintenance compared to a solution developed in AWS EMR, which provides computing infrastructure plus Sqoop, which is used for extraction of data and Apache Spark. This solution allows for effective utilization of AWS DMS features like Broad Database Support, Serverless compute, Elastic scaling and Continuous Replication along with Cost-Effective, Self-Healing, Historical Tracking and Parallelization features of Databricks Ingestion pipelines.
References
1. AWS Documentation: https://docs.aws.amazon.com/dms/
2. Databricks Documentation: https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-volumes
3. SCD Type 2 Documentation: https://community.databricks.com/t5/technical-blog/how-to-implement-slowly-changing-dimensions-when-you-have/ba-p/40568
4. Bulk Upserts Pyspark: https://medium.com/@mvamsikhyd/implement-bulk-upserts-merge-into-rds-rdbms-from-apache-spark-5cbd60047a38