Improving Data Quality in Big Data Systems: Best Practices and Automation Strategies
DOI:
https://doi.org/10.63282/3050-9416.ICAIDSCT26-141Keywords:
Data Quality, Big Data, Automation, Data Observability, Governance, Policy-As-Code, Anomaly Detection, Data Validation, Lineage, MonitoringAbstract
Data quality is a foundational requirement for trustworthy analytics, reliable machine learning, and compliant data operations in large-scale (volume, variety, velocity) environments. Unlike traditional BI stacks where errors can often be traced and fixed manually big data systems amplify quality defects through distributed processing, schema drift, and multi-team transformation layers. This article synthesizes best practices for improving data quality across ingestion, processing, storage, and consumption, with an emphasis on automation strategies: shift-left validation, continuous monitoring, policy-as-code governance, and observable quality SLOs. It further connects operational observability (logs/metrics/traces) to measurable data quality outcomes and proposes a pragmatic automation blueprint using modern orchestration and monitoring patterns. The discussion is grounded in established big-data quality research and industry guidance and aligned with modern governance automation architectures and data pipeline observability techniques [1]–[4].
References
1. Taleb, I., Serhani, M.A., Bouhaddioui, C. et al. Big data quality framework: a holistic approach to continuous quality management. J Big Data 8, 76 (2021). https://doi.org/10.1186/s40537-021-00468-0
2. G. Lawton, "Data quality for big data: Why it's a must and how to improve it," TechTarget, Apr. 27, 2021.
3. W. Harris, "How to improve data quality: 8 steps and best practices," Metaplane, Feb. 19, 2025.
4. U. Nayak, "Automated Data Governance and Compliance Monitoring using AI & Big Data," IJIRMPS, vol. 13, no. 4, pp. 1–3, 2025.
5. U. Nayak, "Best Practices for Logging and Monitoring Big Data Pipelines with Grafana," IJMRGE, vol. 6, no. 3, pp. 835–836, 2025.
6. T. Nguyen, H.-T. Nguyen, and T.-A. Nguyen-Hoang, "Data quality management in big data: Strategies, tools, and educational implications," Journal of Parallel and Distributed Computing, vol. 200, 2025.
7. Tikean, "Data Quality in Big Data: Strategies for Consistency and Scalability," Tikean Blog, Sep. 16, 2024.
8. Datafold, "How to improve data quality: Practical strategies," Datafold Guide, May 28, 2024.
9. Celigo, "Data quality best practices for successful integration and automation," Celigo Blog, Feb. 5, 2025.
10. IBM, "6 Pillars of Data Quality and How to Improve Your Data," IBM Tutorials, 2024.