Unified Data Lake for Multi-modal Healthcare data using Hadoop and MongoDB

Authors

  • Appala Nooka Kumar Doodala Manager Quality Assurance at Cognizant, USA. Author

DOI:

https://doi.org/10.63282/3050-9416.IJAIBDCMS-V5I1P119

Keywords:

Healthcare Data, Data Lake, Hadoop, MongoDB, Multimodal Data Integration, Big Data Analytics, Unified Architecture, Medical Imaging, EHR, IoT

Abstract

Healthcare​‍​‌‍​‍‌ systems generate large amounts of data in diverse forms like electronic health records, medical images, gene sequences, and sensor data, which are usually stored in different silos, thereby making it difficult to carry out integrated analysis and decision-making. The paper presents the architecture design of a unified data lake using Hadoop and MongoDB to manage, store and efficiently process multi-modal healthcare data. With its distributed storage and parallel computing features, Hadoop can handle a large volume of both structured and unstructured data and at the same time, MongoDB offers a flexible schema design and fast querying capabilities for semi-structured clinical and diagnostic data. The integration of these device technologies forms a stable data ecosystem that different healthcare data categories can access in real-time, query efficiently, and be interoperable without any trouble. By integrating different sources into one single unified platform, the proposed system of healthcare organizations increases the capability to perform their comprehensive analytics, glean insightful information, and ease precision medicine programs. The experimental results indicate that the unified data lake is more performant in scalability, fault tolerance, and query response time compared to conventional relational approaches. Besides the unified data lake making data management and analytics simpler, it is also a step towards the integration of new data sources like the outputs from wearable devices and AI-generated diagnostic insights. The next work will involve the implementation of advanced data governance, privacy-preserving analytics, and machine learning pipelines to unlock clinical intelligence and improve patient outcomes through unified healthcare ​‍​‌‍​‍‌data.

References

1. Zhang, Yong, et al. "A heterogeneous multi-modal medical data fusion framework supporting hybrid data exploration." Health Information Science and Systems 10.1 (2022): 22.

2. Zhao, Tao, et al. "Multi-modal medical data exploration based on data lake." International Conference on Health Information Science. Singapore: Springer Nature Singapore, 2023.

3. Ren, Peng, et al. "MHDP: an efficient data lake platform for medical multi-source heterogeneous data." International Conference on Web Information Systems and Applications. Cham: Springer International Publishing, 2021.

4. Fei, Gao. "Multi-modal Medical Data Exploration Based on Data Lake." Health Information Science: 12th International Conference, HIS 2023, Melbourne, VIC, Australia, October 23–24, 2023, Proceedings. Vol. 14305. Springer Nature, 2023.

5. Guntupalli, Bhavitha. "Exception Handling in Large-Scale ETL Systems: Best Practices." International Journal of AI, BigData, Computational and Management Studies 3.4 (2022): 28-36.

6. Ren, Peng, et al. "HMDFF: a heterogeneous medical data fusion framework supporting multimodal query." International Conference on Health Information Science. Cham: Springer International Publishing, 2021.

7. KILANY, RIMA, and YEHIA TAHER. "In Search of Big Medical Data Integration Solutions-A Comprehensive Survey."

8. TAHER, YEHIA. "In Search of Big Medical Data Integration Solutions-A Comprehensive Survey."

9. Das, Subrata Kumar, and Mohammad Zahidur Rahman. "A middleware architecture to integrate and share health data from heterogeneous and diverse data sources." Iran Journal of Computer Science 5.3 (2022): 267-277.

10. He, Xiaoming, et al. "QoE-driven big data architecture for smart city." IEEE Communications Magazine 56.2 (2018): 88-93.

11. Shams, Shayan, et al. "Towards distributed cyberinfrastructure for smart cities using big data and deep learning technologies." 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 2018.

12. Parakala, Adityamallikarjunkumar. "RPA+ AI→ Intelligent Process Automation (IPA)." International Journal of AI, BigData, Computational and Management Studies 4.3 (2023): 112-123.

13. Abedjan, Ziawasch, et al. "Data science in healthcare: Benefits, challenges and opportunities." Data Science for Healthcare: Methodologies and Applications (2019): 3-38.

14. Glake, Daniel, et al. "Towards Polyglot Data Stores--Overview and Open Research Questions." arXiv preprint arXiv:2204.05779 (2022).

15. Ataei, Pouya, and Alan Litchfield. "The state of big data reference architectures: A systematic literature review." IEEE Access 10 (2022): 113789-113807.

16. Kashyap, Hirak, et al. "Big data analytics in bioinformatics: A machine learning perspective." arXiv preprint arXiv:1506.05101 (2015).

17. Parakala, Adityamallikarjunkumar, and Jyothirmay Swain. "AI‑Powered Intelligent Automation Emerges." International Journal of Artificial Intelligence, Data Science, and Machine Learning 3.4 (2022): 96-106.

18. Li, Ye, et al. "HCloud, a healthcare-oriented cloud system with improved efficiency in biomedical data processing." Cloud computing with e-science applications 163 (2015).

19. Padala, S. (2019). AWS Cloud Architecture for Scalable Healthcare Contact Centers. American International Journal of Computer Science and Technology, 1(2), 21-26.

Downloads

Published

2024-03-30

Issue

Section

Articles

How to Cite

1.
Kumar Doodala AN. Unified Data Lake for Multi-modal Healthcare data using Hadoop and MongoDB. IJAIBDCMS [Internet]. 2024 Mar. 30 [cited 2026 Apr. 16];5(1):189-97. Available from: https://ijaibdcms.org/index.php/ijaibdcms/article/view/516