A Review of Data Placement and Replication Strategies Based on Machine Learning

The global increase in data volumes has brought forth the need for scalable distributed systems that can provide satisfactory quality of service. Data placement and replication are well known techniques that provide increased performance, improved fault tolerance and higher availability. These techn...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings - International Conference on Parallel and Distributed Systems pp. 278 - 285
Main Authors	Najjar, Amir, Mokadem, Riad, Pierson, Jean-Marc
Format	Conference Proceeding
Language	English
Published	IEEE 10.10.2024
Subjects	Costs Data Placement Data Replication Distributed databases Distributed Systems Fault tolerance Fault tolerant systems Machine Learning Quality of service Reinforcement learning Taxonomy Time factors Tuning Unsupervised learning
Online Access	Get full text
ISSN	2690-5965
DOI	10.1109/ICPADS63350.2024.00044

Cover

More Information
Summary:	The global increase in data volumes has brought forth the need for scalable distributed systems that can provide satisfactory quality of service. Data placement and replication are well known techniques that provide increased performance, improved fault tolerance and higher availability. These techniques often require threshold-based activation mechanisms that can vary due to the nature of the workload and the underlying system architecture. Hence, setting and adjusting those thresholds usually require human intervention. In this context, machine learning presents a promising facet to automatically define such thresholds to adapt to different workloads and architectures. In this paper, we study the data placement and replication strategies proposed in the literature that employ machine learning. We classify such strategies based on the machine learning method, the platform on which they are deployed, the dynamicity and the achieved objectives. We describe the approach applied by each strategy as well as possible limitations. In addition, we provide insights into metrics used to evaluate the strategies. We highlight the need to design data placement and replication strategies that respond better to modern needs for distributed systems. We also motivate the use of machine learning to achieve autonomy in distributed systems.
ISSN:	2690-5965
DOI:	10.1109/ICPADS63350.2024.00044