CDFRS: A scalable sampling approach for efficient big data analysis

The sampling-based approximation method has demonstrated its potential in various domains such as machine learning, query processing, and data analysis. Most preceding sampling algorithms generate samples at the record level, making it impractical to apply them to very large datasets using a single...

Full description

Saved in:

Bibliographic Details
Published in	Information processing & management Vol. 61; no. 4; p. 103746
Main Authors	Cai, Yongda, Wu, Dingming, Sun, Xudong, Wu, Siyue, Xu, Jingsheng, Huang, Joshua Zhexue
Format	Journal Article
Language	English
Published	Elsevier Ltd 01.07.2024
Subjects	Big data analysis Block-level sampling Random sample partition Scalable sampling Scalable sampling Random sample partition Big data analysis Block-level sampling
Online Access	Get full text
ISSN	0306-4573 1873-5371
DOI	10.1016/j.ipm.2024.103746

Cover

More Information
Summary:	The sampling-based approximation method has demonstrated its potential in various domains such as machine learning, query processing, and data analysis. Most preceding sampling algorithms generate samples at the record level, making it impractical to apply them to very large datasets using a single machine. Even distributed solutions encounter efficiency issues when dealing with terabyte-scale datasets. In this paper, we introduce a scalable sampling approach named CDFRS, which can generate samples with a distribution-preserving guarantee from extensive datasets. CDFRS exhibits significantly improved speed compared to existing sampling algorithms when dealing with terabyte-scale datasets. We provide theoretical guarantees and empirical justifications, demonstrating that samples generated by the CDFRS approach maintain the distribution characteristics of the original dataset. Additionally, we propose a sample size determination algorithm, denoted as A2. Experiment results indicate that the running time of CDFRS shows at least an order of magnitude improvement over other distributed sampling methods. Notably, sampling a 10TB dataset using CDFRS only takes hundreds of seconds, while the compared method requires more than ten thousand seconds. In the context of big data analysis, including tasks such as classification and clustering, models trained with samples generated by CDFRS closely match those trained with the entire training set. Furthermore, the proposed A2 algorithm efficiently determines an appropriate sample size compared with traditional methods. •Propose the CDFRS method for efficiently sampling terabyte-scale datasets.•Propose the A2 algorithm, which efficiently determines the required sample size.•Theoretical guarantees confirm the quality of samples generated by CDFRS.•CDFRS can complete sampling on a 10TB dataset in just hundreds of seconds.•Models trained with samples closely match those trained with the entire dataset.
ISSN:	0306-4573 1873-5371
DOI:	10.1016/j.ipm.2024.103746