CDFRS: A scalable sampling approach for efficient big data analysis

The sampling-based approximation method has demonstrated its potential in various domains such as machine learning, query processing, and data analysis. Most preceding sampling algorithms generate samples at the record level, making it impractical to apply them to very large datasets using a single...

Full description

Saved in:
Bibliographic Details
Published inInformation processing & management Vol. 61; no. 4; p. 103746
Main Authors Cai, Yongda, Wu, Dingming, Sun, Xudong, Wu, Siyue, Xu, Jingsheng, Huang, Joshua Zhexue
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.07.2024
Subjects
Online AccessGet full text
ISSN0306-4573
1873-5371
DOI10.1016/j.ipm.2024.103746

Cover

Abstract The sampling-based approximation method has demonstrated its potential in various domains such as machine learning, query processing, and data analysis. Most preceding sampling algorithms generate samples at the record level, making it impractical to apply them to very large datasets using a single machine. Even distributed solutions encounter efficiency issues when dealing with terabyte-scale datasets. In this paper, we introduce a scalable sampling approach named CDFRS, which can generate samples with a distribution-preserving guarantee from extensive datasets. CDFRS exhibits significantly improved speed compared to existing sampling algorithms when dealing with terabyte-scale datasets. We provide theoretical guarantees and empirical justifications, demonstrating that samples generated by the CDFRS approach maintain the distribution characteristics of the original dataset. Additionally, we propose a sample size determination algorithm, denoted as A2. Experiment results indicate that the running time of CDFRS shows at least an order of magnitude improvement over other distributed sampling methods. Notably, sampling a 10TB dataset using CDFRS only takes hundreds of seconds, while the compared method requires more than ten thousand seconds. In the context of big data analysis, including tasks such as classification and clustering, models trained with samples generated by CDFRS closely match those trained with the entire training set. Furthermore, the proposed A2 algorithm efficiently determines an appropriate sample size compared with traditional methods. •Propose the CDFRS method for efficiently sampling terabyte-scale datasets.•Propose the A2 algorithm, which efficiently determines the required sample size.•Theoretical guarantees confirm the quality of samples generated by CDFRS.•CDFRS can complete sampling on a 10TB dataset in just hundreds of seconds.•Models trained with samples closely match those trained with the entire dataset.
AbstractList The sampling-based approximation method has demonstrated its potential in various domains such as machine learning, query processing, and data analysis. Most preceding sampling algorithms generate samples at the record level, making it impractical to apply them to very large datasets using a single machine. Even distributed solutions encounter efficiency issues when dealing with terabyte-scale datasets. In this paper, we introduce a scalable sampling approach named CDFRS, which can generate samples with a distribution-preserving guarantee from extensive datasets. CDFRS exhibits significantly improved speed compared to existing sampling algorithms when dealing with terabyte-scale datasets. We provide theoretical guarantees and empirical justifications, demonstrating that samples generated by the CDFRS approach maintain the distribution characteristics of the original dataset. Additionally, we propose a sample size determination algorithm, denoted as A2. Experiment results indicate that the running time of CDFRS shows at least an order of magnitude improvement over other distributed sampling methods. Notably, sampling a 10TB dataset using CDFRS only takes hundreds of seconds, while the compared method requires more than ten thousand seconds. In the context of big data analysis, including tasks such as classification and clustering, models trained with samples generated by CDFRS closely match those trained with the entire training set. Furthermore, the proposed A2 algorithm efficiently determines an appropriate sample size compared with traditional methods. •Propose the CDFRS method for efficiently sampling terabyte-scale datasets.•Propose the A2 algorithm, which efficiently determines the required sample size.•Theoretical guarantees confirm the quality of samples generated by CDFRS.•CDFRS can complete sampling on a 10TB dataset in just hundreds of seconds.•Models trained with samples closely match those trained with the entire dataset.
ArticleNumber 103746
Author Cai, Yongda
Wu, Dingming
Xu, Jingsheng
Sun, Xudong
Wu, Siyue
Huang, Joshua Zhexue
Author_xml – sequence: 1
  givenname: Yongda
  orcidid: 0000-0002-3321-879X
  surname: Cai
  fullname: Cai, Yongda
  email: caiyongda2021@email.szu.edu.cn
  organization: National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060, China
– sequence: 2
  givenname: Dingming
  orcidid: 0000-0002-7901-9876
  surname: Wu
  fullname: Wu, Dingming
  email: dingming@szu.edu.cn
  organization: National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060, China
– sequence: 3
  givenname: Xudong
  orcidid: 0009-0005-2171-0081
  surname: Sun
  fullname: Sun, Xudong
  email: sunxudong2016@email.szu.edu.cn
  organization: National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060, China
– sequence: 4
  givenname: Siyue
  surname: Wu
  fullname: Wu, Siyue
  email: 2252271005@email.szu.edu.cn
  organization: National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060, China
– sequence: 5
  givenname: Jingsheng
  surname: Xu
  fullname: Xu, Jingsheng
  email: 2210273049@email.szu.edu.cn
  organization: National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060, China
– sequence: 6
  givenname: Joshua Zhexue
  surname: Huang
  fullname: Huang, Joshua Zhexue
  email: zx.huang@szu.edu.cn
  organization: National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060, China
BookMark eNqN0L1OwzAQwHEPRaItPACbXyDFX3UamKpAAakSEh-zdbGd4spNItuA-vYkChNDxXS64XfS_2Zo0rSNReiKkgUlVF7vF647LBhhot95LuQETQknMhPLnJ-jWYx7QohYUjZFZXm3eXm9wWscNXiovMURDp13zQ5D14UW9Aeu24BtXTvtbJNw5XbYQAIMDfhjdPECndXgo738nXP0vrl_Kx-z7fPDU7neZpoLkjJBTG4kMMELyqhkBnhlCqqNBVEYQ5ksVnwJlGpKDSFQ58auxKogBZO2qoHPERvvfjYdHL_Be9UFd4BwVJSoIV3tVZ-uhnQ1pveIjkiHNsZg63-Z_I_RLkFybZMCOH9S3o7S9n_4cjaoODxNW-OC1UmZ1p3QP7jrhn0
CitedBy_id crossref_primary_10_1016_j_ins_2024_121314
crossref_primary_10_1016_j_knosys_2025_113161
Cites_doi 10.1145/3147.3165
10.14778/3476249.3476262
10.1007/s00158-016-1584-1
10.1016/j.ins.2022.11.108
10.1080/01621459.1962.10480667
10.1016/j.engappai.2024.107934
10.1016/j.eswa.2023.121696
10.1111/rssb.12050
10.1016/S0167-7152(97)00020-5
10.1145/1007568.1007602
10.1145/276305.276343
10.1613/jair.953
10.1609/aaai.v33i01.33013862
10.1016/j.ipm.2021.102758
10.1016/j.jag.2020.102235
10.1145/312129.312188
10.1016/j.ipm.2021.102762
10.1016/j.ipm.2023.103577
10.1016/j.ipm.2023.103326
10.1214/aos/1031689018
10.1016/j.jss.2018.11.007
10.1016/j.engappai.2023.107648
10.1145/3534678.3539377
10.1145/3299869.3300077
10.1016/j.jpdc.2018.04.001
10.1371/journal.pone.0229345
10.1016/j.patcog.2022.109144
10.1016/j.ipm.2023.103271
10.1109/ACCESS.2020.2988120
10.1016/j.ipm.2020.102263
10.14778/3368289.3368302
10.1016/j.ipm.2021.102742
10.1109/TII.2019.2912723
10.1145/3477314.3507311
10.1038/ncomms5308
10.1080/01621459.1963.10500830
10.1023/A:1010933404324
10.26599/BDMA.2019.9020015
10.1145/1327452.1327492
ContentType Journal Article
Copyright 2024 The Author(s)
Copyright_xml – notice: 2024 The Author(s)
DBID 6I.
AAFTH
AAYXX
CITATION
ADTOC
UNPAY
DOI 10.1016/j.ipm.2024.103746
DatabaseName ScienceDirect Open Access Titles
Elsevier:ScienceDirect:Open Access
CrossRef
Unpaywall for CDI: Periodical Content
Unpaywall
DatabaseTitle CrossRef
DatabaseTitleList
Database_xml – sequence: 1
  dbid: UNPAY
  name: Unpaywall
  url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
Discipline Library & Information Science
ExternalDocumentID 10.1016/j.ipm.2024.103746
10_1016_j_ipm_2024_103746
S0306457324001067
GrantInformation_xml – fundername: Natural Science Foundation of Guangdong Province of China
  grantid: 2023A1515011619
  funderid: http://dx.doi.org/10.13039/501100003453
– fundername: Key Basic Research Foundation of Shenzhen
  grantid: JCYJ20220818100205012
GroupedDBID --K
--M
-~X
.DC
.~1
0B8
0R~
1B1
1RT
1~.
1~5
29I
4.4
41~
457
4G.
5GY
5VS
6I.
7-5
71M
77K
8P~
9JN
9JO
AABNK
AACTN
AAEDT
AAEDW
AAFJI
AAFTH
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AAXKI
AAXUO
AAYFN
AAYOK
ABBOA
ABFNM
ABFRF
ABJNI
ABMAC
ABMMH
ABPPZ
ABXDB
ACDAQ
ACGFS
ACHQT
ACNNM
ACRLP
ACZNC
ADBBV
ADEZE
ADJOM
ADMHG
ADMUD
AEBSH
AEFWE
AEKER
AENEX
AFJKZ
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AHHHB
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJOXV
AKRWK
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOMHK
AOUOD
ASPBG
AVARZ
AVWKF
AXJTR
AZFZN
BKOJK
BLXMC
CS3
DU5
EBS
EFJIC
EJD
EO8
EO9
EP2
EP3
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-2
G-Q
GBLVA
GBOLZ
HLZ
HMY
HVGLF
HZ~
H~9
IHE
J1W
KOM
LG9
LPU
LY1
M3Y
M41
MO0
MS~
MVM
N9A
O-L
O9-
OAUVE
OHT
OZT
P-8
P-9
P2P
PC.
PQQKQ
PRBVW
Q38
R2-
RIG
ROL
RPZ
SBC
SDF
SDG
SDP
SDS
SES
SEW
SPC
SPCBC
SSB
SSO
SSS
SSV
SSZ
T5K
TN5
U5U
UHB
UHS
UNMZH
WUQ
ZMT
~G-
77I
AATTM
AAYWO
AAYXX
ABWVN
ACLOT
ACRPL
ACVFH
ADCNI
ADNMO
AEIPS
AEUPX
AFPUW
AGQPQ
AIGII
AIIUN
AKBMS
AKYEP
ANKPU
APXCP
CITATION
EFKBS
EFLBG
~HD
ADTOC
AGCQF
UNPAY
ID FETCH-LOGICAL-c340t-40d7d6a243912162da3bd91cdea49dd1269835a11c11d00af7de84890926ebfa3
IEDL.DBID UNPAY
ISSN 0306-4573
1873-5371
IngestDate Tue Aug 19 16:26:53 EDT 2025
Thu Apr 24 23:10:43 EDT 2025
Wed Oct 01 01:14:19 EDT 2025
Sat Oct 19 15:54:43 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 4
Keywords Scalable sampling
Random sample partition
Big data analysis
Block-level sampling
Language English
License This is an open access article under the CC BY-NC-ND license.
cc-by-nc-nd
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c340t-40d7d6a243912162da3bd91cdea49dd1269835a11c11d00af7de84890926ebfa3
ORCID 0000-0002-3321-879X
0009-0005-2171-0081
0000-0002-7901-9876
OpenAccessLink https://proxy.k.utb.cz/login?url=https://doi.org/10.1016/j.ipm.2024.103746
ParticipantIDs unpaywall_primary_10_1016_j_ipm_2024_103746
crossref_primary_10_1016_j_ipm_2024_103746
crossref_citationtrail_10_1016_j_ipm_2024_103746
elsevier_sciencedirect_doi_10_1016_j_ipm_2024_103746
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate July 2024
2024-07-00
PublicationDateYYYYMMDD 2024-07-01
PublicationDate_xml – month: 07
  year: 2024
  text: July 2024
PublicationDecade 2020
PublicationTitle Information processing & management
PublicationYear 2024
Publisher Elsevier Ltd
Publisher_xml – name: Elsevier Ltd
References Pang, Wang, Xia (b38) 2022; 59
Sun, Zhao, Chen, Cai, Wu, Huang (b49) 2024; 129
Nguyen, Shih, Parvathaneni, Xu, Srivastava, Tirthapura (b36) 2020
Jain, Boyapati, Venkatesh, Prakash (b24) 2022; 59
(pp. 3862–3869).
Zogaj, Cambronero, Rinard, Cito (b60) 2021; 14
Singh, Herten, Deschrijver, Couckuyt, Dhaene (b46) 2017; 55
Breiman (b7) 2001; 45
Wang, Zhou, Luo, Li, Cai (b54) 2024; 237
Salloum, Huang, He (b41) 2019; 15
(pp. 1135–1152).
Israel (b23) 1992
Siegel, Wagner (b44) 2022
Fazul, R. W. A., & Barcelos, P. P. (2022). An event-driven strategy for reactive replica balancing on apache hadoop distributed file system. In
Shvachko, Kuang, Radia, Chansler (b43) 2010
(pp. 255–263).
(pp. 367–370).
Chawla, Bowyer, Hall, Kegelmeyer (b10) 2002; 16
Al-Kateb, Lee (b1) 2010
Ledoit, Wolf (b30) 2002; 30
Efron, Tibshirani (b15) 1994
Haas, P. J., Naughton, J. F., Seshadri, S., & Stokes, L. (1995). Sampling-based estimation of the number of distinct values of an attribute. In
(pp. 157–167).
Bagirov, Aliguliyev, Sultanova (b4) 2023; 135
(pp. 287–298).
Wei, Salloum, Emara, Zhang, Huang, He (b55) 2018
(pp. 311–322).
Provost, F., Jensen, D., & Oates, T. (1999). Efficient progressive sampling. In
Liu, Zhang (b32) 2020; 8
Kalavri, Brundza, Vlassov (b28) 2013
Fan, Muller, Rezucha (b17) 1962; 57
Zhang, Ren, Li, Baharin, Alghamdi, Alghamdi (b57) 2023; 60
Singh (b45) 2003
AWvd (b3) 1998
Yang, Jin, Yu, Hashim (b56) 2023; 60
Blatchford, Mannaerts, Zeng (b6) 2021; 94
Pan, Pedrycz, Yang, Wang (b37) 2024; 132
Chen, X., Zhang, F., & Wang, S. (2022). Efficient Approximate Algorithms for Empirical Variance with Hashed Block Sampling. In
He, Huang, Long, Wang, Wei (b20) 2017
Singh, Masuku (b47) 2014; 2
Cunha, Canuto, Viegas, Salles, Gomes, Mangaravite (b12) 2020; 57
Kleiner, Talwalkar, Sarkar, Jordan (b29) 2014
Dean, Ghemawat (b14) 2008; 51
Park, Y., Qing, J., Shen, X., & Mozafari, B. (2019). Blinkml: Efficient maximum likelihood estimation with probabilistic guarantees. In
Chaudhuri, Motwani, Narasayya (b9) 1998; 27
John, G. H., & Langley, P. (1996). Static Versus Dynamic Sampling for Data Mining. In
Huang, S., Wang, C., Ding, B., & Chaudhuri, S. (2019). Efficient identification of approximate best configuration of training in large datasets. In
Chaudhuri, S., Das, G., & Srivastava, U. (2004). Effective use of block-level sampling in statistics estimation. In
Satyanarayana, Davidson (b42) 2005
Li, Wang, Liu, Chen, Chen (b31) 2023; 621
Apache (b2) 2023
Sunter (b50) 1977; 26
Jenkins, Quintana-Ascencio (b25) 2020; 15
Vitter (b52) 1985; 11
Walenz, Sintos, Roy, Yang (b53) 2019; 13
Hoeffding, Wassily (b21) 1963; 58
De Lange, Aljundi, Masana, Parisot, Jia, Leonardis (b13) 2021; 44
(pp. 23–32).
Sun, Ngueilbaye, Luo, Cai, Wu, Huang (b48) 2024; 61
Loosli, Canu, Bottou (b33) 2007
Mahmud, Huang, Salloum, Emara, Sadatdiynov (b34) 2020; 3
Zhang, Zang, Zhu, Uddin, Amin (b58) 2022; 59
Zhou (b59) 2012
Emara, Huang (b16) 2019; 148
Meng (b35) 2013
Justel, Peña, Zamar (b27) 1997; 35
Baldi, Sadowski, Whiteson (b5) 2014; 5
Veiga, Expósito, Taboada, Tourino (b51) 2018; 120
He (10.1016/j.ipm.2024.103746_b20) 2017
Yang (10.1016/j.ipm.2024.103746_b56) 2023; 60
Al-Kateb (10.1016/j.ipm.2024.103746_b1) 2010
Kalavri (10.1016/j.ipm.2024.103746_b28) 2013
Jenkins (10.1016/j.ipm.2024.103746_b25) 2020; 15
Fan (10.1016/j.ipm.2024.103746_b17) 1962; 57
AWvd (10.1016/j.ipm.2024.103746_b3) 1998
10.1016/j.ipm.2024.103746_b40
Salloum (10.1016/j.ipm.2024.103746_b41) 2019; 15
Singh (10.1016/j.ipm.2024.103746_b47) 2014; 2
Singh (10.1016/j.ipm.2024.103746_b46) 2017; 55
Breiman (10.1016/j.ipm.2024.103746_b7) 2001; 45
Israel (10.1016/j.ipm.2024.103746_b23) 1992
Liu (10.1016/j.ipm.2024.103746_b32) 2020; 8
Zhang (10.1016/j.ipm.2024.103746_b57) 2023; 60
Meng (10.1016/j.ipm.2024.103746_b35) 2013
Veiga (10.1016/j.ipm.2024.103746_b51) 2018; 120
Walenz (10.1016/j.ipm.2024.103746_b53) 2019; 13
Chawla (10.1016/j.ipm.2024.103746_b10) 2002; 16
Sun (10.1016/j.ipm.2024.103746_b48) 2024; 61
Sunter (10.1016/j.ipm.2024.103746_b50) 1977; 26
10.1016/j.ipm.2024.103746_b8
Baldi (10.1016/j.ipm.2024.103746_b5) 2014; 5
Blatchford (10.1016/j.ipm.2024.103746_b6) 2021; 94
10.1016/j.ipm.2024.103746_b39
Zogaj (10.1016/j.ipm.2024.103746_b60) 2021; 14
Zhang (10.1016/j.ipm.2024.103746_b58) 2022; 59
De Lange (10.1016/j.ipm.2024.103746_b13) 2021; 44
Pan (10.1016/j.ipm.2024.103746_b37) 2024; 132
Cunha (10.1016/j.ipm.2024.103746_b12) 2020; 57
Loosli (10.1016/j.ipm.2024.103746_b33) 2007
10.1016/j.ipm.2024.103746_b26
Chaudhuri (10.1016/j.ipm.2024.103746_b9) 1998; 27
Ledoit (10.1016/j.ipm.2024.103746_b30) 2002; 30
Satyanarayana (10.1016/j.ipm.2024.103746_b42) 2005
Li (10.1016/j.ipm.2024.103746_b31) 2023; 621
Siegel (10.1016/j.ipm.2024.103746_b44) 2022
Efron (10.1016/j.ipm.2024.103746_b15) 1994
Vitter (10.1016/j.ipm.2024.103746_b52) 1985; 11
10.1016/j.ipm.2024.103746_b22
Jain (10.1016/j.ipm.2024.103746_b24) 2022; 59
Justel (10.1016/j.ipm.2024.103746_b27) 1997; 35
Nguyen (10.1016/j.ipm.2024.103746_b36) 2020
Apache (10.1016/j.ipm.2024.103746_b2) 2023
Sun (10.1016/j.ipm.2024.103746_b49) 2024; 129
Singh (10.1016/j.ipm.2024.103746_b45) 2003
Dean (10.1016/j.ipm.2024.103746_b14) 2008; 51
Mahmud (10.1016/j.ipm.2024.103746_b34) 2020; 3
Shvachko (10.1016/j.ipm.2024.103746_b43) 2010
Zhou (10.1016/j.ipm.2024.103746_b59) 2012
Bagirov (10.1016/j.ipm.2024.103746_b4) 2023; 135
10.1016/j.ipm.2024.103746_b19
Kleiner (10.1016/j.ipm.2024.103746_b29) 2014
Hoeffding (10.1016/j.ipm.2024.103746_b21) 1963; 58
Wang (10.1016/j.ipm.2024.103746_b54) 2024; 237
10.1016/j.ipm.2024.103746_b18
10.1016/j.ipm.2024.103746_b11
Pang (10.1016/j.ipm.2024.103746_b38) 2022; 59
Emara (10.1016/j.ipm.2024.103746_b16) 2019; 148
Wei (10.1016/j.ipm.2024.103746_b55) 2018
References_xml – reference: (pp. 367–370).
– reference: Haas, P. J., Naughton, J. F., Seshadri, S., & Stokes, L. (1995). Sampling-based estimation of the number of distinct values of an attribute. In
– volume: 59
  year: 2022
  ident: b24
  article-title: An intelligent cognitive-inspired computing with big data analytics framework for sentiment analysis and classification
  publication-title: Information Processing & Management
– volume: 60
  year: 2023
  ident: b56
  article-title: Optimized hadoop map reduce system for strong analytics of cloud big product data on amazon web service
  publication-title: Information Processing & Management
– reference: (pp. 255–263).
– volume: 120
  start-page: 323
  year: 2018
  end-page: 338
  ident: b51
  article-title: Enhancing in-memory efficiency for MapReduce-based data processing
  publication-title: Journal of Parallel and Distributed Computing
– reference: Chaudhuri, S., Das, G., & Srivastava, U. (2004). Effective use of block-level sampling in statistics estimation. In
– volume: 44
  start-page: 3366
  year: 2021
  end-page: 3385
  ident: b13
  article-title: A continual learning survey: Defying forgetting in classification tasks
  publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence
– start-page: 71
  year: 2003
  end-page: 136
  ident: b45
  article-title: Simple random sampling
  publication-title: Advanced sampling theory with applications
– volume: 148
  start-page: 105
  year: 2019
  end-page: 115
  ident: b16
  article-title: A distributed data management system to support large-scale data analysis
  publication-title: Journal of Systems and Software
– volume: 57
  year: 2020
  ident: b12
  article-title: Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling
  publication-title: Information Processing & Management
– volume: 26
  start-page: 261
  year: 1977
  end-page: 268
  ident: b50
  article-title: List sequential sampling with equal or unequal probabilities without replacement
  publication-title: Journal of the Royal Statistical Society. Series C. Applied Statistics
– reference: (pp. 311–322).
– volume: 51
  start-page: 107
  year: 2008
  end-page: 113
  ident: b14
  article-title: MapReduce: simplified data processing on large clusters
  publication-title: Communications of the ACM
– volume: 45
  start-page: 5
  year: 2001
  end-page: 32
  ident: b7
  article-title: Random forests
  publication-title: Machine Learning
– reference: Huang, S., Wang, C., Ding, B., & Chaudhuri, S. (2019). Efficient identification of approximate best configuration of training in large datasets. In
– start-page: 250
  year: 2013
  end-page: 257
  ident: b28
  article-title: Block sampling: Efficient accurate online aggregation in mapreduce
  publication-title: 2013 IEEE 5th international conference on cloud computing technology and science
– volume: 15
  start-page: 5846
  year: 2019
  end-page: 5854
  ident: b41
  article-title: Random sample partition: a distributed data model for big data analysis
  publication-title: IEEE Transactions on Industrial Informatics
– volume: 8
  start-page: 72713
  year: 2020
  end-page: 72726
  ident: b32
  article-title: Sampling for big data profiling: A survey
  publication-title: IEEE Access
– reference: (pp. 3862–3869).
– start-page: 541
  year: 2020
  end-page: 552
  ident: b36
  article-title: Random sampling for group-by queries
  publication-title: 2020 IEEE 36th international conference on data engineering
– volume: 5
  start-page: 4308
  year: 2014
  ident: b5
  article-title: Searching for exotic particles in high-energy physics with deep learning
  publication-title: Nature Communications
– volume: 94
  year: 2021
  ident: b6
  article-title: Determining representative sample size for validation of continuous, large continental remote sensing data
  publication-title: International Journal of Applied Earth Observation and Geoinformation
– volume: 60
  year: 2023
  ident: b57
  article-title: Developing scalable management information system with big financial data using data mart and mining architecture
  publication-title: Information Processing & Management
– volume: 57
  start-page: 387
  year: 1962
  end-page: 402
  ident: b17
  article-title: Development of sampling plans by using sequential (item by item) selection techniques and digital computers
  publication-title: Journal of the American Statistical Association
– volume: 132
  year: 2024
  ident: b37
  article-title: An improved generative adversarial network to oversample imbalanced datasets
  publication-title: Engineering Applications of Artificial Intelligence
– volume: 27
  start-page: 436
  year: 1998
  end-page: 447
  ident: b9
  article-title: Random sampling for histogram construction: How much is enough?
  publication-title: ACM SIGMOD Record
– reference: (pp. 287–298).
– reference: Provost, F., Jensen, D., & Oates, T. (1999). Efficient progressive sampling. In
– year: 2023
  ident: b2
  article-title: Apache spark
– year: 1994
  ident: b15
  article-title: An introduction to the bootstrap
– volume: 237
  year: 2024
  ident: b54
  article-title: Generative adversarial minority enlargement—A local linear over-sampling synthetic method
  publication-title: Expert Systems with Applications
– start-page: 531
  year: 2013
  end-page: 539
  ident: b35
  article-title: Scalable simple random sampling and stratified sampling
  publication-title: International conference on machine learning
– volume: 61
  year: 2024
  ident: b48
  article-title: A scalable and flexible basket analysis system for big transaction data in Spark
  publication-title: Information Processing & Management
– volume: 135
  year: 2023
  ident: b4
  article-title: Finding compact and well-separated clusters: Clustering using silhouette coefficients
  publication-title: Pattern Recognition
– volume: 35
  start-page: 251
  year: 1997
  end-page: 259
  ident: b27
  article-title: A multivariate Kolmogorov-Smirnov test of goodness of fit
  publication-title: Statistics & Probability Letters
– volume: 2
  start-page: 1
  year: 2014
  end-page: 22
  ident: b47
  article-title: Sampling techniques & determination of sample size in applied statistics research: An overview
  publication-title: International Journal of Economics, Commerce and Management
– volume: 15
  year: 2020
  ident: b25
  article-title: A solution to minimum sample size for regressions
  publication-title: PLoS One
– reference: Chen, X., Zhang, F., & Wang, S. (2022). Efficient Approximate Algorithms for Empirical Variance with Hashed Block Sampling. In
– start-page: 301
  year: 2007
  end-page: 320
  ident: b33
  article-title: Training invariant support vector machines using selective sampling
  publication-title: Large scale kernel machines
– reference: (pp. 1135–1152).
– volume: 16
  start-page: 321
  year: 2002
  end-page: 357
  ident: b10
  article-title: SMOTE: synthetic minority over-sampling technique
  publication-title: Journal of Artificial Intelligence Research
– volume: 59
  year: 2022
  ident: b38
  article-title: Information matching model and multi-angle tracking algorithm for loan loss-linking customers based on the family mobile social-contact big data network
  publication-title: Information Processing & Management
– volume: 14
  start-page: 2059
  year: 2021
  end-page: 2072
  ident: b60
  article-title: Doing more with less: characterizing dataset downsampling for AutoML
  publication-title: Proceedings of the VLDB Endowment
– start-page: 1
  year: 1992
  end-page: 5
  ident: b23
  article-title: Determining sample size
  publication-title: a series of the program evaluation and organizational development
– start-page: 631
  year: 2005
  end-page: 640
  ident: b42
  article-title: A dynamic adaptive sampling algorithm (dasa) for real world applications: Finger print recognition and face recognition
  publication-title: International symposium on methodologies for intelligent systems
– start-page: 360
  year: 2017
  end-page: 367
  ident: b20
  article-title: I-sampling: A new block-based sampling method for large-scale dataset
  publication-title: 2017 IEEE international congress on big data
– start-page: 205
  year: 2022
  end-page: 235
  ident: b44
  article-title: Chapter 8 - random sampling: Planning ahead for data gathering
  publication-title: Practical business statistics (eighth edition)
– reference: Park, Y., Qing, J., Shen, X., & Mozafari, B. (2019). Blinkml: Efficient maximum likelihood estimation with probabilistic guarantees. In
– start-page: 1
  year: 2010
  end-page: 10
  ident: b43
  article-title: The hadoop distributed file system
  publication-title: 2010 IEEE 26th symposium on mass storage systems and technologies
– start-page: 347
  year: 2018
  end-page: 364
  ident: b55
  article-title: A two-stage data processing algorithm to generate random sample partitions for big data analysis
  publication-title: International conference on cloud computing
– volume: 55
  start-page: 1425
  year: 2017
  end-page: 1438
  ident: b46
  article-title: A sequential sampling strategy for adaptive classification of computationally expensive data
  publication-title: Structural and Multidisciplinary Optimization
– start-page: 621
  year: 2010
  end-page: 639
  ident: b1
  article-title: Stratified reservoir sampling over heterogeneous data streams
  publication-title: International conference on scientific and statistical database management
– reference: Fazul, R. W. A., & Barcelos, P. P. (2022). An event-driven strategy for reactive replica balancing on apache hadoop distributed file system. In
– reference: (pp. 23–32).
– year: 1998
  ident: b3
  publication-title: Asymptotic statistics
– volume: 3
  start-page: 85
  year: 2020
  end-page: 101
  ident: b34
  article-title: A survey of data partitioning and sampling methods to support big data analysis
  publication-title: Big Data Mining and Analytics
– volume: 11
  start-page: 37
  year: 1985
  end-page: 57
  ident: b52
  article-title: Random sampling with a reservoir
  publication-title: ACM Transactions on Mathematical Software
– year: 2012
  ident: b59
  article-title: Ensemble methods: foundations and algorithms
– volume: 13
  start-page: 390
  year: 2019
  end-page: 402
  ident: b53
  article-title: Learning to sample: counting with complex queries
  publication-title: Proceedings of the VLDB Endowment
– volume: 58
  start-page: 13
  year: 1963
  end-page: 30
  ident: b21
  article-title: Probability inequalities for sums of bounded random variables
  publication-title: Publications of the American Statistical Association
– volume: 129
  year: 2024
  ident: b49
  article-title: Non-MapReduce computing for intelligent big data analysis
  publication-title: Engineering Applications of Artificial Intelligence
– volume: 30
  start-page: 1081
  year: 2002
  end-page: 1102
  ident: b30
  article-title: Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size
  publication-title: The Annals of Statistics
– start-page: 795
  year: 2014
  end-page: 816
  ident: b29
  article-title: A scalable bootstrap for massive data
  publication-title: Journal of the Royal Statistical Society. Series B. Statistical Methodology
– reference: John, G. H., & Langley, P. (1996). Static Versus Dynamic Sampling for Data Mining. In
– volume: 621
  start-page: 371
  year: 2023
  end-page: 388
  ident: b31
  article-title: Subspace-based minority oversampling for imbalance classification
  publication-title: Information Sciences
– reference: (pp. 157–167).
– volume: 59
  year: 2022
  ident: b58
  article-title: Big data-assisted social media analytics for business model for business decision making system competitive analysis
  publication-title: Information Processing & Management
– volume: 11
  start-page: 37
  issue: 1
  year: 1985
  ident: 10.1016/j.ipm.2024.103746_b52
  article-title: Random sampling with a reservoir
  publication-title: ACM Transactions on Mathematical Software
  doi: 10.1145/3147.3165
– start-page: 347
  year: 2018
  ident: 10.1016/j.ipm.2024.103746_b55
  article-title: A two-stage data processing algorithm to generate random sample partitions for big data analysis
– volume: 14
  start-page: 2059
  issue: 11
  year: 2021
  ident: 10.1016/j.ipm.2024.103746_b60
  article-title: Doing more with less: characterizing dataset downsampling for AutoML
  publication-title: Proceedings of the VLDB Endowment
  doi: 10.14778/3476249.3476262
– start-page: 205
  year: 2022
  ident: 10.1016/j.ipm.2024.103746_b44
  article-title: Chapter 8 - random sampling: Planning ahead for data gathering
– volume: 55
  start-page: 1425
  year: 2017
  ident: 10.1016/j.ipm.2024.103746_b46
  article-title: A sequential sampling strategy for adaptive classification of computationally expensive data
  publication-title: Structural and Multidisciplinary Optimization
  doi: 10.1007/s00158-016-1584-1
– start-page: 621
  year: 2010
  ident: 10.1016/j.ipm.2024.103746_b1
  article-title: Stratified reservoir sampling over heterogeneous data streams
– volume: 621
  start-page: 371
  year: 2023
  ident: 10.1016/j.ipm.2024.103746_b31
  article-title: Subspace-based minority oversampling for imbalance classification
  publication-title: Information Sciences
  doi: 10.1016/j.ins.2022.11.108
– volume: 2
  start-page: 1
  issue: 11
  year: 2014
  ident: 10.1016/j.ipm.2024.103746_b47
  article-title: Sampling techniques & determination of sample size in applied statistics research: An overview
  publication-title: International Journal of Economics, Commerce and Management
– start-page: 1
  year: 2010
  ident: 10.1016/j.ipm.2024.103746_b43
  article-title: The hadoop distributed file system
– year: 2023
  ident: 10.1016/j.ipm.2024.103746_b2
– year: 2012
  ident: 10.1016/j.ipm.2024.103746_b59
– start-page: 1
  year: 1992
  ident: 10.1016/j.ipm.2024.103746_b23
  article-title: Determining sample size
– volume: 57
  start-page: 387
  issue: 298
  year: 1962
  ident: 10.1016/j.ipm.2024.103746_b17
  article-title: Development of sampling plans by using sequential (item by item) selection techniques and digital computers
  publication-title: Journal of the American Statistical Association
  doi: 10.1080/01621459.1962.10480667
– start-page: 301
  year: 2007
  ident: 10.1016/j.ipm.2024.103746_b33
  article-title: Training invariant support vector machines using selective sampling
– volume: 132
  year: 2024
  ident: 10.1016/j.ipm.2024.103746_b37
  article-title: An improved generative adversarial network to oversample imbalanced datasets
  publication-title: Engineering Applications of Artificial Intelligence
  doi: 10.1016/j.engappai.2024.107934
– volume: 237
  year: 2024
  ident: 10.1016/j.ipm.2024.103746_b54
  article-title: Generative adversarial minority enlargement—A local linear over-sampling synthetic method
  publication-title: Expert Systems with Applications
  doi: 10.1016/j.eswa.2023.121696
– start-page: 250
  year: 2013
  ident: 10.1016/j.ipm.2024.103746_b28
  article-title: Block sampling: Efficient accurate online aggregation in mapreduce
– start-page: 795
  year: 2014
  ident: 10.1016/j.ipm.2024.103746_b29
  article-title: A scalable bootstrap for massive data
  publication-title: Journal of the Royal Statistical Society. Series B. Statistical Methodology
  doi: 10.1111/rssb.12050
– volume: 35
  start-page: 251
  issue: 3
  year: 1997
  ident: 10.1016/j.ipm.2024.103746_b27
  article-title: A multivariate Kolmogorov-Smirnov test of goodness of fit
  publication-title: Statistics & Probability Letters
  doi: 10.1016/S0167-7152(97)00020-5
– ident: 10.1016/j.ipm.2024.103746_b8
  doi: 10.1145/1007568.1007602
– volume: 27
  start-page: 436
  issue: 2
  year: 1998
  ident: 10.1016/j.ipm.2024.103746_b9
  article-title: Random sampling for histogram construction: How much is enough?
  publication-title: ACM SIGMOD Record
  doi: 10.1145/276305.276343
– volume: 16
  start-page: 321
  year: 2002
  ident: 10.1016/j.ipm.2024.103746_b10
  article-title: SMOTE: synthetic minority over-sampling technique
  publication-title: Journal of Artificial Intelligence Research
  doi: 10.1613/jair.953
– ident: 10.1016/j.ipm.2024.103746_b26
– ident: 10.1016/j.ipm.2024.103746_b19
– ident: 10.1016/j.ipm.2024.103746_b22
  doi: 10.1609/aaai.v33i01.33013862
– volume: 59
  issue: 1
  year: 2022
  ident: 10.1016/j.ipm.2024.103746_b24
  article-title: An intelligent cognitive-inspired computing with big data analytics framework for sentiment analysis and classification
  publication-title: Information Processing & Management
  doi: 10.1016/j.ipm.2021.102758
– volume: 94
  year: 2021
  ident: 10.1016/j.ipm.2024.103746_b6
  article-title: Determining representative sample size for validation of continuous, large continental remote sensing data
  publication-title: International Journal of Applied Earth Observation and Geoinformation
  doi: 10.1016/j.jag.2020.102235
– start-page: 360
  year: 2017
  ident: 10.1016/j.ipm.2024.103746_b20
  article-title: I-sampling: A new block-based sampling method for large-scale dataset
– ident: 10.1016/j.ipm.2024.103746_b40
  doi: 10.1145/312129.312188
– volume: 26
  start-page: 261
  issue: 3
  year: 1977
  ident: 10.1016/j.ipm.2024.103746_b50
  article-title: List sequential sampling with equal or unequal probabilities without replacement
  publication-title: Journal of the Royal Statistical Society. Series C. Applied Statistics
– year: 1994
  ident: 10.1016/j.ipm.2024.103746_b15
– volume: 59
  issue: 1
  year: 2022
  ident: 10.1016/j.ipm.2024.103746_b58
  article-title: Big data-assisted social media analytics for business model for business decision making system competitive analysis
  publication-title: Information Processing & Management
  doi: 10.1016/j.ipm.2021.102762
– start-page: 71
  year: 2003
  ident: 10.1016/j.ipm.2024.103746_b45
  article-title: Simple random sampling
– volume: 61
  issue: 2
  year: 2024
  ident: 10.1016/j.ipm.2024.103746_b48
  article-title: A scalable and flexible basket analysis system for big transaction data in Spark
  publication-title: Information Processing & Management
  doi: 10.1016/j.ipm.2023.103577
– volume: 60
  issue: 3
  year: 2023
  ident: 10.1016/j.ipm.2024.103746_b57
  article-title: Developing scalable management information system with big financial data using data mart and mining architecture
  publication-title: Information Processing & Management
  doi: 10.1016/j.ipm.2023.103326
– volume: 30
  start-page: 1081
  issue: 4
  year: 2002
  ident: 10.1016/j.ipm.2024.103746_b30
  article-title: Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size
  publication-title: The Annals of Statistics
  doi: 10.1214/aos/1031689018
– volume: 148
  start-page: 105
  year: 2019
  ident: 10.1016/j.ipm.2024.103746_b16
  article-title: A distributed data management system to support large-scale data analysis
  publication-title: Journal of Systems and Software
  doi: 10.1016/j.jss.2018.11.007
– volume: 129
  year: 2024
  ident: 10.1016/j.ipm.2024.103746_b49
  article-title: Non-MapReduce computing for intelligent big data analysis
  publication-title: Engineering Applications of Artificial Intelligence
  doi: 10.1016/j.engappai.2023.107648
– ident: 10.1016/j.ipm.2024.103746_b11
  doi: 10.1145/3534678.3539377
– ident: 10.1016/j.ipm.2024.103746_b39
  doi: 10.1145/3299869.3300077
– start-page: 531
  year: 2013
  ident: 10.1016/j.ipm.2024.103746_b35
  article-title: Scalable simple random sampling and stratified sampling
– volume: 120
  start-page: 323
  year: 2018
  ident: 10.1016/j.ipm.2024.103746_b51
  article-title: Enhancing in-memory efficiency for MapReduce-based data processing
  publication-title: Journal of Parallel and Distributed Computing
  doi: 10.1016/j.jpdc.2018.04.001
– volume: 15
  issue: 2
  year: 2020
  ident: 10.1016/j.ipm.2024.103746_b25
  article-title: A solution to minimum sample size for regressions
  publication-title: PLoS One
  doi: 10.1371/journal.pone.0229345
– volume: 135
  year: 2023
  ident: 10.1016/j.ipm.2024.103746_b4
  article-title: Finding compact and well-separated clusters: Clustering using silhouette coefficients
  publication-title: Pattern Recognition
  doi: 10.1016/j.patcog.2022.109144
– volume: 60
  issue: 3
  year: 2023
  ident: 10.1016/j.ipm.2024.103746_b56
  article-title: Optimized hadoop map reduce system for strong analytics of cloud big product data on amazon web service
  publication-title: Information Processing & Management
  doi: 10.1016/j.ipm.2023.103271
– volume: 8
  start-page: 72713
  year: 2020
  ident: 10.1016/j.ipm.2024.103746_b32
  article-title: Sampling for big data profiling: A survey
  publication-title: IEEE Access
  doi: 10.1109/ACCESS.2020.2988120
– start-page: 541
  year: 2020
  ident: 10.1016/j.ipm.2024.103746_b36
  article-title: Random sampling for group-by queries
– volume: 57
  issue: 4
  year: 2020
  ident: 10.1016/j.ipm.2024.103746_b12
  article-title: Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling
  publication-title: Information Processing & Management
  doi: 10.1016/j.ipm.2020.102263
– volume: 13
  start-page: 390
  issue: 3
  year: 2019
  ident: 10.1016/j.ipm.2024.103746_b53
  article-title: Learning to sample: counting with complex queries
  publication-title: Proceedings of the VLDB Endowment
  doi: 10.14778/3368289.3368302
– volume: 59
  issue: 1
  year: 2022
  ident: 10.1016/j.ipm.2024.103746_b38
  article-title: Information matching model and multi-angle tracking algorithm for loan loss-linking customers based on the family mobile social-contact big data network
  publication-title: Information Processing & Management
  doi: 10.1016/j.ipm.2021.102742
– volume: 15
  start-page: 5846
  issue: 11
  year: 2019
  ident: 10.1016/j.ipm.2024.103746_b41
  article-title: Random sample partition: a distributed data model for big data analysis
  publication-title: IEEE Transactions on Industrial Informatics
  doi: 10.1109/TII.2019.2912723
– start-page: 631
  year: 2005
  ident: 10.1016/j.ipm.2024.103746_b42
  article-title: A dynamic adaptive sampling algorithm (dasa) for real world applications: Finger print recognition and face recognition
– year: 1998
  ident: 10.1016/j.ipm.2024.103746_b3
– ident: 10.1016/j.ipm.2024.103746_b18
  doi: 10.1145/3477314.3507311
– volume: 5
  start-page: 4308
  issue: 1
  year: 2014
  ident: 10.1016/j.ipm.2024.103746_b5
  article-title: Searching for exotic particles in high-energy physics with deep learning
  publication-title: Nature Communications
  doi: 10.1038/ncomms5308
– volume: 58
  start-page: 13
  issue: 301
  year: 1963
  ident: 10.1016/j.ipm.2024.103746_b21
  article-title: Probability inequalities for sums of bounded random variables
  publication-title: Publications of the American Statistical Association
  doi: 10.1080/01621459.1963.10500830
– volume: 45
  start-page: 5
  year: 2001
  ident: 10.1016/j.ipm.2024.103746_b7
  article-title: Random forests
  publication-title: Machine Learning
  doi: 10.1023/A:1010933404324
– volume: 3
  start-page: 85
  issue: 2
  year: 2020
  ident: 10.1016/j.ipm.2024.103746_b34
  article-title: A survey of data partitioning and sampling methods to support big data analysis
  publication-title: Big Data Mining and Analytics
  doi: 10.26599/BDMA.2019.9020015
– volume: 51
  start-page: 107
  issue: 1
  year: 2008
  ident: 10.1016/j.ipm.2024.103746_b14
  article-title: MapReduce: simplified data processing on large clusters
  publication-title: Communications of the ACM
  doi: 10.1145/1327452.1327492
– volume: 44
  start-page: 3366
  issue: 7
  year: 2021
  ident: 10.1016/j.ipm.2024.103746_b13
  article-title: A continual learning survey: Defying forgetting in classification tasks
  publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence
SSID ssj0004512
Score 2.417407
Snippet The sampling-based approximation method has demonstrated its potential in various domains such as machine learning, query processing, and data analysis. Most...
SourceID unpaywall
crossref
elsevier
SourceType Open Access Repository
Enrichment Source
Index Database
Publisher
StartPage 103746
SubjectTerms Big data analysis
Block-level sampling
Random sample partition
Scalable sampling
SummonAdditionalLinks – databaseName: Elsevier ScienceDirect
  dbid: .~1
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3PS8MwFA5jF72IP3HqJAfxoNQ1bZY03kZ1DEEPzsFuJU3SMRm1uA3x4t9uXpuOCTLBY0seKe-l77023_cFoYsokL6SWngsVMqjXRV6kcxCL2Up9znnghn43_H4xAYj-jDujhsorrkwAKt0ub_K6WW2dnc6zpudYjrtDKHbpV0OinKlEBow2CmHUwxuvsiaYjhxOwnMg9H1zmaJ8ZoWQEYPaEk9hx7499q0tcwL-fkhZ7O12tPfRTuuacS96rn2UMPk-6jtKAf4EjtOEfgYu5f1AMXxXf95eIt7eG7jAAwpPJeAH88nuFYSx9YOm1JFwhYfnE4nGCCjWDqpkkM06t-_xAPPHZngqZD6C_s1qLlmMgA-bUBYoGWYakGUNpIKrUnAhG25JCGKEO37MuPaRDQSvgiYSTMZHqFm_pabY4QDTbopywwzYMuyVEoZZUYJobmxTVgL-bWzEuX0xOFYi1lSA8deE-vfBPybVP5toauVSVGJaWwaTOsIJD9WRGKT_Saz61W0_p7k5H-TnKJtuKpgu2eouXhfmrZtThbpebn6vgEiRN_d
  priority: 102
  providerName: Elsevier
Title CDFRS: A scalable sampling approach for efficient big data analysis
URI https://dx.doi.org/10.1016/j.ipm.2024.103746
https://doi.org/10.1016/j.ipm.2024.103746
UnpaywallVersion publishedVersion
Volume 61
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVESC
  databaseName: Baden-Württemberg Complete Freedom Collection (Elsevier)
  issn: 0306-4573
  databaseCode: GBLVA
  dateStart: 20110101
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://www.sciencedirect.com
  omitProxy: true
  ssIdentifier: ssj0004512
  providerName: Elsevier
– providerCode: PRVESC
  databaseName: Elsevier ScienceDirect
  issn: 0306-4573
  databaseCode: .~1
  dateStart: 19950101
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://www.sciencedirect.com
  omitProxy: true
  ssIdentifier: ssj0004512
  providerName: Elsevier
– providerCode: PRVESC
  databaseName: Elsevier SD Complete Freedom Collection [SCCMFC]
  issn: 0306-4573
  databaseCode: ACRLP
  dateStart: 19950101
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://www.sciencedirect.com
  omitProxy: true
  ssIdentifier: ssj0004512
  providerName: Elsevier
– providerCode: PRVESC
  databaseName: Elsevier SD Freedom Collection Journals [SCFCJ]
  issn: 0306-4573
  databaseCode: AIKHN
  dateStart: 19950101
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://www.sciencedirect.com
  omitProxy: true
  ssIdentifier: ssj0004512
  providerName: Elsevier
– providerCode: PRVLSH
  databaseName: Elsevier Journals
  issn: 0306-4573
  databaseCode: AKRWK
  dateStart: 19750101
  customDbUrl:
  isFulltext: true
  mediaType: online
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0004512
  providerName: Library Specific Holdings
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1dS8MwFA1ue_DJb3HiRh7EB6Wj6UfS-DamYyoOUQfzqeSrYzrqcBuiD_52kzYdKjr1Pbcp9ybktPecEwD2I4-5gknqYF8IJwiF70Qs8R2OOXEJIRQr87_jsos7veC8H_atWbTRwnzq32c8rOHYCMa9IJOHB7gEKjjUsLsMKr3uVfMu7xJgPU3WTUYR8Z3QJ6joYH73jJ_OoOVZOmYvz2w0-nDGtFdzdtYksyY01JKHxmzKG-L1i3Hjn15_DaxYpAmb-dJYB0sq3QA1q1OAB9AKkUxhoN3hm6DVOmlf3xzDJpzo4hlZFZwwQzpPB7CwH4c6DqrMekKfWJAPB9DwTCGz_iZboNc-vW11HHvPgiP8wJ3qT0hJJGaeEeF6CHuS-VxSJKRiAZUSeZhqnMYQEghJ12UJkSoKIupSDyueMH8blNPHVO0A6EkUcpworEwsTjhjLEqUoFQSpZFbFbhF5mNhTcjNXRijuGCb3cc6Y7HJWJxnrAoO5yHj3IFj0eCgKGdsIUQODWJdmUVhR_PS_z7J7r9G74Hy9Gmmahq7THkdlBpvqA4qzbOLTrdu1_A7At_qmQ
linkProvider Unpaywall
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8QwEA7retCL-MTXag7iQanbpGnSeJPVZX3twQd4K2mSyspSF10RL_52M20qK8gKXtsMKd8kk0kz3xeE9hKqQq2MDHikdcBiHQWJyqMg45kIhRCSW_jfcd3nvXt28RA_NFCn5sJAWaWP_VVML6O1f9L2aLZHg0H7FrJdFgtQlCuF0GbQLIupgB3Y0SeZkAwn_iiBB9C8Ptosi7wGI2CjU1ZyzyEJ_n1xmnsrRurjXQ2HE4tPdxEt-KwRn1QftoQatlhGLc85wPvYk4oAZOxn6wrqdE67N7fH-AS_OkcARQq_KiggLx5xLSWOnR22pYyEW31wNnjEUDOKldcqWUX33bO7Ti_wdyYEOmLh2G0HjTBcUSDUUsKpUVFmJNHGKiaNIZRLl3MpQjQhJgxVLoxNWCJDSbnNchWtoWbxXNh1hKkhccZzyy3Y8jxTSiW51VIaYV0WtoHCGqxUe0FxuNdimNaVY0-pwzcFfNMK3w108G0yqtQ0pjVmtQfSH0MiddF-mtnht7f-7mTzf53sorne3fVVenXev9xC8_CmquHdRs3xy5ttuUxlnO2UI_ELeFXjAA
linkToUnpaywall http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1dS8MwFA26PfjktzhxkgfxQelo2jRpfBvTMQSHqIP5VPLVoY46XIforzdp06GiU99zm3JvQk57zzkB4DAOuC-5Yh4JpfRwJEMv5mnoCSKoTyllRNv_HZd90hvgi2E0dGbRVgvzqX9f8LDuJ1YwHuBCHo7JMqiTyMDuGqgP-lftu7JLQMw0RTcZxTT0opCiqoP53TN-OoNWZtmEv77w8fjDGdNdK9lZ08Ka0FJLHluzXLTk2xfjxj-9_jpYdUgTtsulsQGWdLYJmk6nAI-gEyLZwkC3w7dAp3PWvb45hW04NcWzsio45ZZ0no1gZT8OTRzUhfWEObGguB9ByzOF3PmbbINB9_y20_PcPQueDLGfm09IRRXhgRXhBogEiodCMSSV5pgphQLCDE7jCEmElO_zlCod45j5LCBapDzcAbXsKdO7AAYKRYKkmmgbS1LBOY9TLRlTVBvk1gB-lflEOhNyexfGOKnYZg-JyVhiM5aUGWuA43nIpHTgWDQYV-VMHIQooUFiKrMo7GRe-t8n2fvX6H1Qy59nummwSy4O3Kp9B0UJ6A0
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=CDFRS%3A+A+scalable+sampling+approach+for+efficient+big+data+analysis&rft.jtitle=Information+processing+%26+management&rft.au=Cai%2C+Yongda&rft.au=Wu%2C+Dingming&rft.au=Sun%2C+Xudong&rft.au=Wu%2C+Siyue&rft.date=2024-07-01&rft.issn=0306-4573&rft.volume=61&rft.issue=4&rft.spage=103746&rft_id=info:doi/10.1016%2Fj.ipm.2024.103746&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_ipm_2024_103746
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0306-4573&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0306-4573&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0306-4573&client=summon