Data Versioning for Iterative Refinement: Adapting ML Experiment Tracking Tools for Data-Centric AI Pipelines

Authors

  • Rajani Kumari Vaddepalli Frisco, Texas, USA. Author

DOI:

https://doi.org/10.63282/3050-9416.IJAIBDCMS-V5I3P113

Keywords:

Data-centric AI, dataset versioning, ML experiment tracking, reproducibility, data provenance, iterative refinement, ML flow, Weights & Biases, data debugging, pipeline automation

Abstract

The increasing emphasis on data-centric AI has highlighted the need for systematic approaches to manage evolving datasets in machine learning (ML) pipelines. While ML experiment tracking tools like ML flow and Weights & Biases (W&B) excel at versioning models and hyperparameters, they lack robust mechanisms for tracking dataset iterations such as corrections, augmentations, and subset selections that are critical in data-centric workflows. This paper bridges this gap by proposing a framework that extends existing ML experiment tracking paradigms to support data versioning, enabling reproducibility, auditability, and iterative refinement in data-centric AI. We draw inspiration from two key works: (1) "Dataset Versioning for Machine Learning: A Survey" (2023), which formalizes the challenges of dataset evolution tracking, and (2) "Data Fed: Towards Reproducible Deep Learning via Reliable Data Management" (2022), which introduces a federated data versioning system for large-scale ML. Our framework adapts these principles to integrate seamlessly with popular ML tracking tools, introducing data diffs (fine-grained change logs), provenance graphs (to track transformations), and conditional triggering (to automate pipeline stages based on data updates).

We evaluate our approach on three real-world case studies: (a) a financial fraud detection system where transaction datasets are frequently revised, (b) a medical imaging pipeline with iterative label corrections, and (c) a recommendation engine with dynamic user feedback integration. Results show that our method reduces dataset reproducibility errors by 62% compared to ad-hoc versioning (e.g., manual CSV backups) while adding minimal overhead (<5% runtime penalty) to existing ML workflows. Additionally, we demonstrate how our framework enables data debugging by tracing model performance regressions to specific dataset changes a capability absent in current model-centric tools. This work contributes: (1) a methodology for adapting ML experiment trackers to handle dataset versioning, (2) an open-source implementation compatible with ML flow and W&B, and (3) empirical validation of its benefits across diverse domains. Our findings advocate for treating data as a first-class artifact in ML pipelines, aligning with the broader shift toward data-centric AI.

References

1. N. Polyzotis et al., "Data Management Challenges in Production Machine Learning," Proc. of ACM SIGMOD, pp. 1723-1726, 2021.

2. L. Biewald, "Experiment Tracking for Data-Centric AI," IEEE Int. Conf. on Data Eng., pp. 2354-2357, 2020.

3. S. E. Whang et al., "Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective," ACM Computing Surveys, vol. 54, no. 9, pp. 1-32, 2021.

4. D. Kreuzberger et al., "Machine Learning Operations (MLOps): Overview, Definition, and Architecture," IEEE Transactions on Software Engineering, vol. 49, no. 3, pp. 1458-1475, 2022.

5. M. Zahariev et al., "Dataset Versioning in Machine Learning Pipelines: Approaches and Trade-offs," IEEE Transactions on Big Data, vol. 8, no. 2, pp. 345-360, 2022.

6. N. Hollmann et al., "MLDatasetOps: Data Versioning for Machine Learning," Proc. VLDB Endow., vol. 14, no. 12, pp. 2882-2895, 2021.

7. A. Kumar et al., "Challenges in Production Dataset Versioning Systems," IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 5, pp. 2149-2163, 2021.

8. S. Schelter et al., "On the Challenges of Operationalizing Data Versioning," Proc. ACM SIGMOD Int. Conf. Manage. Data, pp. 2067-2080, 2020.

9. J. Chen et al., "Declarative Dataset Versioning for Machine Learning Pipelines," Proc. VLDB Endow., vol. 15, no. 8, pp. 1574-1587, 2022.

10. H. Miao et al., "Version-Aware Machine Learning: Principles and Practices," IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 6, pp. 2801-2815, 2021.

11. Y. Zhang et al., "Federated Dataset Versioning for Distributed Machine Learning," IEEE Transactions on Big Data, vol. 9, no. 1, pp. 112-125, 2022.

12. T. Li et al., "Ethical Considerations in Dataset Version Control," Proc. ACM SIGKDD Conf. Knowl. Discov. Data Min., pp. 3456-3467, 2023.

Downloads

Published

2024-10-30

Issue

Section

Articles

How to Cite

1.
Vaddepalli RK. Data Versioning for Iterative Refinement: Adapting ML Experiment Tracking Tools for Data-Centric AI Pipelines . IJAIBDCMS [Internet]. 2024 Oct. 30 [cited 2025 Sep. 13];5(3):129-37. Available from: https://ijaibdcms.org/index.php/ijaibdcms/article/view/215