CDFRS: A scalable sampling approach for efficient big data analysis
The sampling-based approximation method has demonstrated its potential in various domains such as machine learning, query processing, and data analysis. Most preceding sampling algorithms generate samples at the record level, making it impractical to apply them to very large datasets using a single...
        Saved in:
      
    
          | Published in | Information processing & management Vol. 61; no. 4; p. 103746 | 
|---|---|
| Main Authors | , , , , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
            Elsevier Ltd
    
        01.07.2024
     | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 0306-4573 1873-5371  | 
| DOI | 10.1016/j.ipm.2024.103746 | 
Cover
| Abstract | The sampling-based approximation method has demonstrated its potential in various domains such as machine learning, query processing, and data analysis. Most preceding sampling algorithms generate samples at the record level, making it impractical to apply them to very large datasets using a single machine. Even distributed solutions encounter efficiency issues when dealing with terabyte-scale datasets. In this paper, we introduce a scalable sampling approach named CDFRS, which can generate samples with a distribution-preserving guarantee from extensive datasets. CDFRS exhibits significantly improved speed compared to existing sampling algorithms when dealing with terabyte-scale datasets. We provide theoretical guarantees and empirical justifications, demonstrating that samples generated by the CDFRS approach maintain the distribution characteristics of the original dataset. Additionally, we propose a sample size determination algorithm, denoted as A2. Experiment results indicate that the running time of CDFRS shows at least an order of magnitude improvement over other distributed sampling methods. Notably, sampling a 10TB dataset using CDFRS only takes hundreds of seconds, while the compared method requires more than ten thousand seconds. In the context of big data analysis, including tasks such as classification and clustering, models trained with samples generated by CDFRS closely match those trained with the entire training set. Furthermore, the proposed A2 algorithm efficiently determines an appropriate sample size compared with traditional methods.
•Propose the CDFRS method for efficiently sampling terabyte-scale datasets.•Propose the A2 algorithm, which efficiently determines the required sample size.•Theoretical guarantees confirm the quality of samples generated by CDFRS.•CDFRS can complete sampling on a 10TB dataset in just hundreds of seconds.•Models trained with samples closely match those trained with the entire dataset. | 
    
|---|---|
| AbstractList | The sampling-based approximation method has demonstrated its potential in various domains such as machine learning, query processing, and data analysis. Most preceding sampling algorithms generate samples at the record level, making it impractical to apply them to very large datasets using a single machine. Even distributed solutions encounter efficiency issues when dealing with terabyte-scale datasets. In this paper, we introduce a scalable sampling approach named CDFRS, which can generate samples with a distribution-preserving guarantee from extensive datasets. CDFRS exhibits significantly improved speed compared to existing sampling algorithms when dealing with terabyte-scale datasets. We provide theoretical guarantees and empirical justifications, demonstrating that samples generated by the CDFRS approach maintain the distribution characteristics of the original dataset. Additionally, we propose a sample size determination algorithm, denoted as A2. Experiment results indicate that the running time of CDFRS shows at least an order of magnitude improvement over other distributed sampling methods. Notably, sampling a 10TB dataset using CDFRS only takes hundreds of seconds, while the compared method requires more than ten thousand seconds. In the context of big data analysis, including tasks such as classification and clustering, models trained with samples generated by CDFRS closely match those trained with the entire training set. Furthermore, the proposed A2 algorithm efficiently determines an appropriate sample size compared with traditional methods.
•Propose the CDFRS method for efficiently sampling terabyte-scale datasets.•Propose the A2 algorithm, which efficiently determines the required sample size.•Theoretical guarantees confirm the quality of samples generated by CDFRS.•CDFRS can complete sampling on a 10TB dataset in just hundreds of seconds.•Models trained with samples closely match those trained with the entire dataset. | 
    
| ArticleNumber | 103746 | 
    
| Author | Cai, Yongda Wu, Dingming Xu, Jingsheng Sun, Xudong Wu, Siyue Huang, Joshua Zhexue  | 
    
| Author_xml | – sequence: 1 givenname: Yongda orcidid: 0000-0002-3321-879X surname: Cai fullname: Cai, Yongda email: caiyongda2021@email.szu.edu.cn organization: National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060, China – sequence: 2 givenname: Dingming orcidid: 0000-0002-7901-9876 surname: Wu fullname: Wu, Dingming email: dingming@szu.edu.cn organization: National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060, China – sequence: 3 givenname: Xudong orcidid: 0009-0005-2171-0081 surname: Sun fullname: Sun, Xudong email: sunxudong2016@email.szu.edu.cn organization: National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060, China – sequence: 4 givenname: Siyue surname: Wu fullname: Wu, Siyue email: 2252271005@email.szu.edu.cn organization: National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060, China – sequence: 5 givenname: Jingsheng surname: Xu fullname: Xu, Jingsheng email: 2210273049@email.szu.edu.cn organization: National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060, China – sequence: 6 givenname: Joshua Zhexue surname: Huang fullname: Huang, Joshua Zhexue email: zx.huang@szu.edu.cn organization: National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060, China  | 
    
| BookMark | eNqN0L1OwzAQwHEPRaItPACbXyDFX3UamKpAAakSEh-zdbGd4spNItuA-vYkChNDxXS64XfS_2Zo0rSNReiKkgUlVF7vF647LBhhot95LuQETQknMhPLnJ-jWYx7QohYUjZFZXm3eXm9wWscNXiovMURDp13zQ5D14UW9Aeu24BtXTvtbJNw5XbYQAIMDfhjdPECndXgo738nXP0vrl_Kx-z7fPDU7neZpoLkjJBTG4kMMELyqhkBnhlCqqNBVEYQ5ksVnwJlGpKDSFQ58auxKogBZO2qoHPERvvfjYdHL_Be9UFd4BwVJSoIV3tVZ-uhnQ1pveIjkiHNsZg63-Z_I_RLkFybZMCOH9S3o7S9n_4cjaoODxNW-OC1UmZ1p3QP7jrhn0 | 
    
| CitedBy_id | crossref_primary_10_1016_j_ins_2024_121314 crossref_primary_10_1016_j_knosys_2025_113161  | 
    
| Cites_doi | 10.1145/3147.3165 10.14778/3476249.3476262 10.1007/s00158-016-1584-1 10.1016/j.ins.2022.11.108 10.1080/01621459.1962.10480667 10.1016/j.engappai.2024.107934 10.1016/j.eswa.2023.121696 10.1111/rssb.12050 10.1016/S0167-7152(97)00020-5 10.1145/1007568.1007602 10.1145/276305.276343 10.1613/jair.953 10.1609/aaai.v33i01.33013862 10.1016/j.ipm.2021.102758 10.1016/j.jag.2020.102235 10.1145/312129.312188 10.1016/j.ipm.2021.102762 10.1016/j.ipm.2023.103577 10.1016/j.ipm.2023.103326 10.1214/aos/1031689018 10.1016/j.jss.2018.11.007 10.1016/j.engappai.2023.107648 10.1145/3534678.3539377 10.1145/3299869.3300077 10.1016/j.jpdc.2018.04.001 10.1371/journal.pone.0229345 10.1016/j.patcog.2022.109144 10.1016/j.ipm.2023.103271 10.1109/ACCESS.2020.2988120 10.1016/j.ipm.2020.102263 10.14778/3368289.3368302 10.1016/j.ipm.2021.102742 10.1109/TII.2019.2912723 10.1145/3477314.3507311 10.1038/ncomms5308 10.1080/01621459.1963.10500830 10.1023/A:1010933404324 10.26599/BDMA.2019.9020015 10.1145/1327452.1327492  | 
    
| ContentType | Journal Article | 
    
| Copyright | 2024 The Author(s) | 
    
| Copyright_xml | – notice: 2024 The Author(s) | 
    
| DBID | 6I. AAFTH AAYXX CITATION ADTOC UNPAY  | 
    
| DOI | 10.1016/j.ipm.2024.103746 | 
    
| DatabaseName | ScienceDirect Open Access Titles Elsevier:ScienceDirect:Open Access CrossRef Unpaywall for CDI: Periodical Content Unpaywall  | 
    
| DatabaseTitle | CrossRef | 
    
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: UNPAY name: Unpaywall url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/ sourceTypes: Open Access Repository  | 
    
| DeliveryMethod | fulltext_linktorsrc | 
    
| Discipline | Library & Information Science | 
    
| ExternalDocumentID | 10.1016/j.ipm.2024.103746 10_1016_j_ipm_2024_103746 S0306457324001067  | 
    
| GrantInformation_xml | – fundername: Natural Science Foundation of Guangdong Province of China grantid: 2023A1515011619 funderid: http://dx.doi.org/10.13039/501100003453 – fundername: Key Basic Research Foundation of Shenzhen grantid: JCYJ20220818100205012  | 
    
| GroupedDBID | --K --M -~X .DC .~1 0B8 0R~ 1B1 1RT 1~. 1~5 29I 4.4 41~ 457 4G. 5GY 5VS 6I. 7-5 71M 77K 8P~ 9JN 9JO AABNK AACTN AAEDT AAEDW AAFJI AAFTH AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXKI AAXUO AAYFN AAYOK ABBOA ABFNM ABFRF ABJNI ABMAC ABMMH ABPPZ ABXDB ACDAQ ACGFS ACHQT ACNNM ACRLP ACZNC ADBBV ADEZE ADJOM ADMHG ADMUD AEBSH AEFWE AEKER AENEX AFJKZ AFKWA AFTJW AGHFR AGUBO AGYEJ AHHHB AHZHX AIALX AIEXJ AIKHN AITUG AJOXV AKRWK ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOMHK AOUOD ASPBG AVARZ AVWKF AXJTR AZFZN BKOJK BLXMC CS3 DU5 EBS EFJIC EJD EO8 EO9 EP2 EP3 FDB FEDTE FGOYB FIRID FNPLU FYGXN G-2 G-Q GBLVA GBOLZ HLZ HMY HVGLF HZ~ H~9 IHE J1W KOM LG9 LPU LY1 M3Y M41 MO0 MS~ MVM N9A O-L O9- OAUVE OHT OZT P-8 P-9 P2P PC. PQQKQ PRBVW Q38 R2- RIG ROL RPZ SBC SDF SDG SDP SDS SES SEW SPC SPCBC SSB SSO SSS SSV SSZ T5K TN5 U5U UHB UHS UNMZH WUQ ZMT ~G- 77I AATTM AAYWO AAYXX ABWVN ACLOT ACRPL ACVFH ADCNI ADNMO AEIPS AEUPX AFPUW AGQPQ AIGII AIIUN AKBMS AKYEP ANKPU APXCP CITATION EFKBS EFLBG ~HD ADTOC AGCQF UNPAY  | 
    
| ID | FETCH-LOGICAL-c340t-40d7d6a243912162da3bd91cdea49dd1269835a11c11d00af7de84890926ebfa3 | 
    
| IEDL.DBID | UNPAY | 
    
| ISSN | 0306-4573 1873-5371  | 
    
| IngestDate | Tue Aug 19 16:26:53 EDT 2025 Thu Apr 24 23:10:43 EDT 2025 Wed Oct 01 01:14:19 EDT 2025 Sat Oct 19 15:54:43 EDT 2024  | 
    
| IsDoiOpenAccess | true | 
    
| IsOpenAccess | true | 
    
| IsPeerReviewed | true | 
    
| IsScholarly | true | 
    
| Issue | 4 | 
    
| Keywords | Scalable sampling Random sample partition Big data analysis Block-level sampling  | 
    
| Language | English | 
    
| License | This is an open access article under the CC BY-NC-ND license. cc-by-nc-nd  | 
    
| LinkModel | DirectLink | 
    
| MergedId | FETCHMERGED-LOGICAL-c340t-40d7d6a243912162da3bd91cdea49dd1269835a11c11d00af7de84890926ebfa3 | 
    
| ORCID | 0000-0002-3321-879X 0009-0005-2171-0081 0000-0002-7901-9876  | 
    
| OpenAccessLink | https://proxy.k.utb.cz/login?url=https://doi.org/10.1016/j.ipm.2024.103746 | 
    
| ParticipantIDs | unpaywall_primary_10_1016_j_ipm_2024_103746 crossref_primary_10_1016_j_ipm_2024_103746 crossref_citationtrail_10_1016_j_ipm_2024_103746 elsevier_sciencedirect_doi_10_1016_j_ipm_2024_103746  | 
    
| ProviderPackageCode | CITATION AAYXX  | 
    
| PublicationCentury | 2000 | 
    
| PublicationDate | July 2024 2024-07-00  | 
    
| PublicationDateYYYYMMDD | 2024-07-01 | 
    
| PublicationDate_xml | – month: 07 year: 2024 text: July 2024  | 
    
| PublicationDecade | 2020 | 
    
| PublicationTitle | Information processing & management | 
    
| PublicationYear | 2024 | 
    
| Publisher | Elsevier Ltd | 
    
| Publisher_xml | – name: Elsevier Ltd | 
    
| References | Pang, Wang, Xia (b38) 2022; 59 Sun, Zhao, Chen, Cai, Wu, Huang (b49) 2024; 129 Nguyen, Shih, Parvathaneni, Xu, Srivastava, Tirthapura (b36) 2020 Jain, Boyapati, Venkatesh, Prakash (b24) 2022; 59 (pp. 3862–3869). Zogaj, Cambronero, Rinard, Cito (b60) 2021; 14 Singh, Herten, Deschrijver, Couckuyt, Dhaene (b46) 2017; 55 Breiman (b7) 2001; 45 Wang, Zhou, Luo, Li, Cai (b54) 2024; 237 Salloum, Huang, He (b41) 2019; 15 (pp. 1135–1152). Israel (b23) 1992 Siegel, Wagner (b44) 2022 Fazul, R. W. A., & Barcelos, P. P. (2022). An event-driven strategy for reactive replica balancing on apache hadoop distributed file system. In Shvachko, Kuang, Radia, Chansler (b43) 2010 (pp. 255–263). (pp. 367–370). Chawla, Bowyer, Hall, Kegelmeyer (b10) 2002; 16 Al-Kateb, Lee (b1) 2010 Ledoit, Wolf (b30) 2002; 30 Efron, Tibshirani (b15) 1994 Haas, P. J., Naughton, J. F., Seshadri, S., & Stokes, L. (1995). Sampling-based estimation of the number of distinct values of an attribute. In (pp. 157–167). Bagirov, Aliguliyev, Sultanova (b4) 2023; 135 (pp. 287–298). Wei, Salloum, Emara, Zhang, Huang, He (b55) 2018 (pp. 311–322). Provost, F., Jensen, D., & Oates, T. (1999). Efficient progressive sampling. In Liu, Zhang (b32) 2020; 8 Kalavri, Brundza, Vlassov (b28) 2013 Fan, Muller, Rezucha (b17) 1962; 57 Zhang, Ren, Li, Baharin, Alghamdi, Alghamdi (b57) 2023; 60 Singh (b45) 2003 AWvd (b3) 1998 Yang, Jin, Yu, Hashim (b56) 2023; 60 Blatchford, Mannaerts, Zeng (b6) 2021; 94 Pan, Pedrycz, Yang, Wang (b37) 2024; 132 Chen, X., Zhang, F., & Wang, S. (2022). Efficient Approximate Algorithms for Empirical Variance with Hashed Block Sampling. In He, Huang, Long, Wang, Wei (b20) 2017 Singh, Masuku (b47) 2014; 2 Cunha, Canuto, Viegas, Salles, Gomes, Mangaravite (b12) 2020; 57 Kleiner, Talwalkar, Sarkar, Jordan (b29) 2014 Dean, Ghemawat (b14) 2008; 51 Park, Y., Qing, J., Shen, X., & Mozafari, B. (2019). Blinkml: Efficient maximum likelihood estimation with probabilistic guarantees. In Chaudhuri, Motwani, Narasayya (b9) 1998; 27 John, G. H., & Langley, P. (1996). Static Versus Dynamic Sampling for Data Mining. In Huang, S., Wang, C., Ding, B., & Chaudhuri, S. (2019). Efficient identification of approximate best configuration of training in large datasets. In Chaudhuri, S., Das, G., & Srivastava, U. (2004). Effective use of block-level sampling in statistics estimation. In Satyanarayana, Davidson (b42) 2005 Li, Wang, Liu, Chen, Chen (b31) 2023; 621 Apache (b2) 2023 Sunter (b50) 1977; 26 Jenkins, Quintana-Ascencio (b25) 2020; 15 Vitter (b52) 1985; 11 Walenz, Sintos, Roy, Yang (b53) 2019; 13 Hoeffding, Wassily (b21) 1963; 58 De Lange, Aljundi, Masana, Parisot, Jia, Leonardis (b13) 2021; 44 (pp. 23–32). Sun, Ngueilbaye, Luo, Cai, Wu, Huang (b48) 2024; 61 Loosli, Canu, Bottou (b33) 2007 Mahmud, Huang, Salloum, Emara, Sadatdiynov (b34) 2020; 3 Zhang, Zang, Zhu, Uddin, Amin (b58) 2022; 59 Zhou (b59) 2012 Emara, Huang (b16) 2019; 148 Meng (b35) 2013 Justel, Peña, Zamar (b27) 1997; 35 Baldi, Sadowski, Whiteson (b5) 2014; 5 Veiga, Expósito, Taboada, Tourino (b51) 2018; 120 He (10.1016/j.ipm.2024.103746_b20) 2017 Yang (10.1016/j.ipm.2024.103746_b56) 2023; 60 Al-Kateb (10.1016/j.ipm.2024.103746_b1) 2010 Kalavri (10.1016/j.ipm.2024.103746_b28) 2013 Jenkins (10.1016/j.ipm.2024.103746_b25) 2020; 15 Fan (10.1016/j.ipm.2024.103746_b17) 1962; 57 AWvd (10.1016/j.ipm.2024.103746_b3) 1998 10.1016/j.ipm.2024.103746_b40 Salloum (10.1016/j.ipm.2024.103746_b41) 2019; 15 Singh (10.1016/j.ipm.2024.103746_b47) 2014; 2 Singh (10.1016/j.ipm.2024.103746_b46) 2017; 55 Breiman (10.1016/j.ipm.2024.103746_b7) 2001; 45 Israel (10.1016/j.ipm.2024.103746_b23) 1992 Liu (10.1016/j.ipm.2024.103746_b32) 2020; 8 Zhang (10.1016/j.ipm.2024.103746_b57) 2023; 60 Meng (10.1016/j.ipm.2024.103746_b35) 2013 Veiga (10.1016/j.ipm.2024.103746_b51) 2018; 120 Walenz (10.1016/j.ipm.2024.103746_b53) 2019; 13 Chawla (10.1016/j.ipm.2024.103746_b10) 2002; 16 Sun (10.1016/j.ipm.2024.103746_b48) 2024; 61 Sunter (10.1016/j.ipm.2024.103746_b50) 1977; 26 10.1016/j.ipm.2024.103746_b8 Baldi (10.1016/j.ipm.2024.103746_b5) 2014; 5 Blatchford (10.1016/j.ipm.2024.103746_b6) 2021; 94 10.1016/j.ipm.2024.103746_b39 Zogaj (10.1016/j.ipm.2024.103746_b60) 2021; 14 Zhang (10.1016/j.ipm.2024.103746_b58) 2022; 59 De Lange (10.1016/j.ipm.2024.103746_b13) 2021; 44 Pan (10.1016/j.ipm.2024.103746_b37) 2024; 132 Cunha (10.1016/j.ipm.2024.103746_b12) 2020; 57 Loosli (10.1016/j.ipm.2024.103746_b33) 2007 10.1016/j.ipm.2024.103746_b26 Chaudhuri (10.1016/j.ipm.2024.103746_b9) 1998; 27 Ledoit (10.1016/j.ipm.2024.103746_b30) 2002; 30 Satyanarayana (10.1016/j.ipm.2024.103746_b42) 2005 Li (10.1016/j.ipm.2024.103746_b31) 2023; 621 Siegel (10.1016/j.ipm.2024.103746_b44) 2022 Efron (10.1016/j.ipm.2024.103746_b15) 1994 Vitter (10.1016/j.ipm.2024.103746_b52) 1985; 11 10.1016/j.ipm.2024.103746_b22 Jain (10.1016/j.ipm.2024.103746_b24) 2022; 59 Justel (10.1016/j.ipm.2024.103746_b27) 1997; 35 Nguyen (10.1016/j.ipm.2024.103746_b36) 2020 Apache (10.1016/j.ipm.2024.103746_b2) 2023 Sun (10.1016/j.ipm.2024.103746_b49) 2024; 129 Singh (10.1016/j.ipm.2024.103746_b45) 2003 Dean (10.1016/j.ipm.2024.103746_b14) 2008; 51 Mahmud (10.1016/j.ipm.2024.103746_b34) 2020; 3 Shvachko (10.1016/j.ipm.2024.103746_b43) 2010 Zhou (10.1016/j.ipm.2024.103746_b59) 2012 Bagirov (10.1016/j.ipm.2024.103746_b4) 2023; 135 10.1016/j.ipm.2024.103746_b19 Kleiner (10.1016/j.ipm.2024.103746_b29) 2014 Hoeffding (10.1016/j.ipm.2024.103746_b21) 1963; 58 Wang (10.1016/j.ipm.2024.103746_b54) 2024; 237 10.1016/j.ipm.2024.103746_b18 10.1016/j.ipm.2024.103746_b11 Pang (10.1016/j.ipm.2024.103746_b38) 2022; 59 Emara (10.1016/j.ipm.2024.103746_b16) 2019; 148 Wei (10.1016/j.ipm.2024.103746_b55) 2018  | 
    
| References_xml | – reference: (pp. 367–370). – reference: Haas, P. J., Naughton, J. F., Seshadri, S., & Stokes, L. (1995). Sampling-based estimation of the number of distinct values of an attribute. In – volume: 59 year: 2022 ident: b24 article-title: An intelligent cognitive-inspired computing with big data analytics framework for sentiment analysis and classification publication-title: Information Processing & Management – volume: 60 year: 2023 ident: b56 article-title: Optimized hadoop map reduce system for strong analytics of cloud big product data on amazon web service publication-title: Information Processing & Management – reference: (pp. 255–263). – volume: 120 start-page: 323 year: 2018 end-page: 338 ident: b51 article-title: Enhancing in-memory efficiency for MapReduce-based data processing publication-title: Journal of Parallel and Distributed Computing – reference: Chaudhuri, S., Das, G., & Srivastava, U. (2004). Effective use of block-level sampling in statistics estimation. In – volume: 44 start-page: 3366 year: 2021 end-page: 3385 ident: b13 article-title: A continual learning survey: Defying forgetting in classification tasks publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence – start-page: 71 year: 2003 end-page: 136 ident: b45 article-title: Simple random sampling publication-title: Advanced sampling theory with applications – volume: 148 start-page: 105 year: 2019 end-page: 115 ident: b16 article-title: A distributed data management system to support large-scale data analysis publication-title: Journal of Systems and Software – volume: 57 year: 2020 ident: b12 article-title: Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling publication-title: Information Processing & Management – volume: 26 start-page: 261 year: 1977 end-page: 268 ident: b50 article-title: List sequential sampling with equal or unequal probabilities without replacement publication-title: Journal of the Royal Statistical Society. Series C. Applied Statistics – reference: (pp. 311–322). – volume: 51 start-page: 107 year: 2008 end-page: 113 ident: b14 article-title: MapReduce: simplified data processing on large clusters publication-title: Communications of the ACM – volume: 45 start-page: 5 year: 2001 end-page: 32 ident: b7 article-title: Random forests publication-title: Machine Learning – reference: Huang, S., Wang, C., Ding, B., & Chaudhuri, S. (2019). Efficient identification of approximate best configuration of training in large datasets. In – start-page: 250 year: 2013 end-page: 257 ident: b28 article-title: Block sampling: Efficient accurate online aggregation in mapreduce publication-title: 2013 IEEE 5th international conference on cloud computing technology and science – volume: 15 start-page: 5846 year: 2019 end-page: 5854 ident: b41 article-title: Random sample partition: a distributed data model for big data analysis publication-title: IEEE Transactions on Industrial Informatics – volume: 8 start-page: 72713 year: 2020 end-page: 72726 ident: b32 article-title: Sampling for big data profiling: A survey publication-title: IEEE Access – reference: (pp. 3862–3869). – start-page: 541 year: 2020 end-page: 552 ident: b36 article-title: Random sampling for group-by queries publication-title: 2020 IEEE 36th international conference on data engineering – volume: 5 start-page: 4308 year: 2014 ident: b5 article-title: Searching for exotic particles in high-energy physics with deep learning publication-title: Nature Communications – volume: 94 year: 2021 ident: b6 article-title: Determining representative sample size for validation of continuous, large continental remote sensing data publication-title: International Journal of Applied Earth Observation and Geoinformation – volume: 60 year: 2023 ident: b57 article-title: Developing scalable management information system with big financial data using data mart and mining architecture publication-title: Information Processing & Management – volume: 57 start-page: 387 year: 1962 end-page: 402 ident: b17 article-title: Development of sampling plans by using sequential (item by item) selection techniques and digital computers publication-title: Journal of the American Statistical Association – volume: 132 year: 2024 ident: b37 article-title: An improved generative adversarial network to oversample imbalanced datasets publication-title: Engineering Applications of Artificial Intelligence – volume: 27 start-page: 436 year: 1998 end-page: 447 ident: b9 article-title: Random sampling for histogram construction: How much is enough? publication-title: ACM SIGMOD Record – reference: (pp. 287–298). – reference: Provost, F., Jensen, D., & Oates, T. (1999). Efficient progressive sampling. In – year: 2023 ident: b2 article-title: Apache spark – year: 1994 ident: b15 article-title: An introduction to the bootstrap – volume: 237 year: 2024 ident: b54 article-title: Generative adversarial minority enlargement—A local linear over-sampling synthetic method publication-title: Expert Systems with Applications – start-page: 531 year: 2013 end-page: 539 ident: b35 article-title: Scalable simple random sampling and stratified sampling publication-title: International conference on machine learning – volume: 61 year: 2024 ident: b48 article-title: A scalable and flexible basket analysis system for big transaction data in Spark publication-title: Information Processing & Management – volume: 135 year: 2023 ident: b4 article-title: Finding compact and well-separated clusters: Clustering using silhouette coefficients publication-title: Pattern Recognition – volume: 35 start-page: 251 year: 1997 end-page: 259 ident: b27 article-title: A multivariate Kolmogorov-Smirnov test of goodness of fit publication-title: Statistics & Probability Letters – volume: 2 start-page: 1 year: 2014 end-page: 22 ident: b47 article-title: Sampling techniques & determination of sample size in applied statistics research: An overview publication-title: International Journal of Economics, Commerce and Management – volume: 15 year: 2020 ident: b25 article-title: A solution to minimum sample size for regressions publication-title: PLoS One – reference: Chen, X., Zhang, F., & Wang, S. (2022). Efficient Approximate Algorithms for Empirical Variance with Hashed Block Sampling. In – start-page: 301 year: 2007 end-page: 320 ident: b33 article-title: Training invariant support vector machines using selective sampling publication-title: Large scale kernel machines – reference: (pp. 1135–1152). – volume: 16 start-page: 321 year: 2002 end-page: 357 ident: b10 article-title: SMOTE: synthetic minority over-sampling technique publication-title: Journal of Artificial Intelligence Research – volume: 59 year: 2022 ident: b38 article-title: Information matching model and multi-angle tracking algorithm for loan loss-linking customers based on the family mobile social-contact big data network publication-title: Information Processing & Management – volume: 14 start-page: 2059 year: 2021 end-page: 2072 ident: b60 article-title: Doing more with less: characterizing dataset downsampling for AutoML publication-title: Proceedings of the VLDB Endowment – start-page: 1 year: 1992 end-page: 5 ident: b23 article-title: Determining sample size publication-title: a series of the program evaluation and organizational development – start-page: 631 year: 2005 end-page: 640 ident: b42 article-title: A dynamic adaptive sampling algorithm (dasa) for real world applications: Finger print recognition and face recognition publication-title: International symposium on methodologies for intelligent systems – start-page: 360 year: 2017 end-page: 367 ident: b20 article-title: I-sampling: A new block-based sampling method for large-scale dataset publication-title: 2017 IEEE international congress on big data – start-page: 205 year: 2022 end-page: 235 ident: b44 article-title: Chapter 8 - random sampling: Planning ahead for data gathering publication-title: Practical business statistics (eighth edition) – reference: Park, Y., Qing, J., Shen, X., & Mozafari, B. (2019). Blinkml: Efficient maximum likelihood estimation with probabilistic guarantees. In – start-page: 1 year: 2010 end-page: 10 ident: b43 article-title: The hadoop distributed file system publication-title: 2010 IEEE 26th symposium on mass storage systems and technologies – start-page: 347 year: 2018 end-page: 364 ident: b55 article-title: A two-stage data processing algorithm to generate random sample partitions for big data analysis publication-title: International conference on cloud computing – volume: 55 start-page: 1425 year: 2017 end-page: 1438 ident: b46 article-title: A sequential sampling strategy for adaptive classification of computationally expensive data publication-title: Structural and Multidisciplinary Optimization – start-page: 621 year: 2010 end-page: 639 ident: b1 article-title: Stratified reservoir sampling over heterogeneous data streams publication-title: International conference on scientific and statistical database management – reference: Fazul, R. W. A., & Barcelos, P. P. (2022). An event-driven strategy for reactive replica balancing on apache hadoop distributed file system. In – reference: (pp. 23–32). – year: 1998 ident: b3 publication-title: Asymptotic statistics – volume: 3 start-page: 85 year: 2020 end-page: 101 ident: b34 article-title: A survey of data partitioning and sampling methods to support big data analysis publication-title: Big Data Mining and Analytics – volume: 11 start-page: 37 year: 1985 end-page: 57 ident: b52 article-title: Random sampling with a reservoir publication-title: ACM Transactions on Mathematical Software – year: 2012 ident: b59 article-title: Ensemble methods: foundations and algorithms – volume: 13 start-page: 390 year: 2019 end-page: 402 ident: b53 article-title: Learning to sample: counting with complex queries publication-title: Proceedings of the VLDB Endowment – volume: 58 start-page: 13 year: 1963 end-page: 30 ident: b21 article-title: Probability inequalities for sums of bounded random variables publication-title: Publications of the American Statistical Association – volume: 129 year: 2024 ident: b49 article-title: Non-MapReduce computing for intelligent big data analysis publication-title: Engineering Applications of Artificial Intelligence – volume: 30 start-page: 1081 year: 2002 end-page: 1102 ident: b30 article-title: Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size publication-title: The Annals of Statistics – start-page: 795 year: 2014 end-page: 816 ident: b29 article-title: A scalable bootstrap for massive data publication-title: Journal of the Royal Statistical Society. Series B. Statistical Methodology – reference: John, G. H., & Langley, P. (1996). Static Versus Dynamic Sampling for Data Mining. In – volume: 621 start-page: 371 year: 2023 end-page: 388 ident: b31 article-title: Subspace-based minority oversampling for imbalance classification publication-title: Information Sciences – reference: (pp. 157–167). – volume: 59 year: 2022 ident: b58 article-title: Big data-assisted social media analytics for business model for business decision making system competitive analysis publication-title: Information Processing & Management – volume: 11 start-page: 37 issue: 1 year: 1985 ident: 10.1016/j.ipm.2024.103746_b52 article-title: Random sampling with a reservoir publication-title: ACM Transactions on Mathematical Software doi: 10.1145/3147.3165 – start-page: 347 year: 2018 ident: 10.1016/j.ipm.2024.103746_b55 article-title: A two-stage data processing algorithm to generate random sample partitions for big data analysis – volume: 14 start-page: 2059 issue: 11 year: 2021 ident: 10.1016/j.ipm.2024.103746_b60 article-title: Doing more with less: characterizing dataset downsampling for AutoML publication-title: Proceedings of the VLDB Endowment doi: 10.14778/3476249.3476262 – start-page: 205 year: 2022 ident: 10.1016/j.ipm.2024.103746_b44 article-title: Chapter 8 - random sampling: Planning ahead for data gathering – volume: 55 start-page: 1425 year: 2017 ident: 10.1016/j.ipm.2024.103746_b46 article-title: A sequential sampling strategy for adaptive classification of computationally expensive data publication-title: Structural and Multidisciplinary Optimization doi: 10.1007/s00158-016-1584-1 – start-page: 621 year: 2010 ident: 10.1016/j.ipm.2024.103746_b1 article-title: Stratified reservoir sampling over heterogeneous data streams – volume: 621 start-page: 371 year: 2023 ident: 10.1016/j.ipm.2024.103746_b31 article-title: Subspace-based minority oversampling for imbalance classification publication-title: Information Sciences doi: 10.1016/j.ins.2022.11.108 – volume: 2 start-page: 1 issue: 11 year: 2014 ident: 10.1016/j.ipm.2024.103746_b47 article-title: Sampling techniques & determination of sample size in applied statistics research: An overview publication-title: International Journal of Economics, Commerce and Management – start-page: 1 year: 2010 ident: 10.1016/j.ipm.2024.103746_b43 article-title: The hadoop distributed file system – year: 2023 ident: 10.1016/j.ipm.2024.103746_b2 – year: 2012 ident: 10.1016/j.ipm.2024.103746_b59 – start-page: 1 year: 1992 ident: 10.1016/j.ipm.2024.103746_b23 article-title: Determining sample size – volume: 57 start-page: 387 issue: 298 year: 1962 ident: 10.1016/j.ipm.2024.103746_b17 article-title: Development of sampling plans by using sequential (item by item) selection techniques and digital computers publication-title: Journal of the American Statistical Association doi: 10.1080/01621459.1962.10480667 – start-page: 301 year: 2007 ident: 10.1016/j.ipm.2024.103746_b33 article-title: Training invariant support vector machines using selective sampling – volume: 132 year: 2024 ident: 10.1016/j.ipm.2024.103746_b37 article-title: An improved generative adversarial network to oversample imbalanced datasets publication-title: Engineering Applications of Artificial Intelligence doi: 10.1016/j.engappai.2024.107934 – volume: 237 year: 2024 ident: 10.1016/j.ipm.2024.103746_b54 article-title: Generative adversarial minority enlargement—A local linear over-sampling synthetic method publication-title: Expert Systems with Applications doi: 10.1016/j.eswa.2023.121696 – start-page: 250 year: 2013 ident: 10.1016/j.ipm.2024.103746_b28 article-title: Block sampling: Efficient accurate online aggregation in mapreduce – start-page: 795 year: 2014 ident: 10.1016/j.ipm.2024.103746_b29 article-title: A scalable bootstrap for massive data publication-title: Journal of the Royal Statistical Society. Series B. Statistical Methodology doi: 10.1111/rssb.12050 – volume: 35 start-page: 251 issue: 3 year: 1997 ident: 10.1016/j.ipm.2024.103746_b27 article-title: A multivariate Kolmogorov-Smirnov test of goodness of fit publication-title: Statistics & Probability Letters doi: 10.1016/S0167-7152(97)00020-5 – ident: 10.1016/j.ipm.2024.103746_b8 doi: 10.1145/1007568.1007602 – volume: 27 start-page: 436 issue: 2 year: 1998 ident: 10.1016/j.ipm.2024.103746_b9 article-title: Random sampling for histogram construction: How much is enough? publication-title: ACM SIGMOD Record doi: 10.1145/276305.276343 – volume: 16 start-page: 321 year: 2002 ident: 10.1016/j.ipm.2024.103746_b10 article-title: SMOTE: synthetic minority over-sampling technique publication-title: Journal of Artificial Intelligence Research doi: 10.1613/jair.953 – ident: 10.1016/j.ipm.2024.103746_b26 – ident: 10.1016/j.ipm.2024.103746_b19 – ident: 10.1016/j.ipm.2024.103746_b22 doi: 10.1609/aaai.v33i01.33013862 – volume: 59 issue: 1 year: 2022 ident: 10.1016/j.ipm.2024.103746_b24 article-title: An intelligent cognitive-inspired computing with big data analytics framework for sentiment analysis and classification publication-title: Information Processing & Management doi: 10.1016/j.ipm.2021.102758 – volume: 94 year: 2021 ident: 10.1016/j.ipm.2024.103746_b6 article-title: Determining representative sample size for validation of continuous, large continental remote sensing data publication-title: International Journal of Applied Earth Observation and Geoinformation doi: 10.1016/j.jag.2020.102235 – start-page: 360 year: 2017 ident: 10.1016/j.ipm.2024.103746_b20 article-title: I-sampling: A new block-based sampling method for large-scale dataset – ident: 10.1016/j.ipm.2024.103746_b40 doi: 10.1145/312129.312188 – volume: 26 start-page: 261 issue: 3 year: 1977 ident: 10.1016/j.ipm.2024.103746_b50 article-title: List sequential sampling with equal or unequal probabilities without replacement publication-title: Journal of the Royal Statistical Society. Series C. Applied Statistics – year: 1994 ident: 10.1016/j.ipm.2024.103746_b15 – volume: 59 issue: 1 year: 2022 ident: 10.1016/j.ipm.2024.103746_b58 article-title: Big data-assisted social media analytics for business model for business decision making system competitive analysis publication-title: Information Processing & Management doi: 10.1016/j.ipm.2021.102762 – start-page: 71 year: 2003 ident: 10.1016/j.ipm.2024.103746_b45 article-title: Simple random sampling – volume: 61 issue: 2 year: 2024 ident: 10.1016/j.ipm.2024.103746_b48 article-title: A scalable and flexible basket analysis system for big transaction data in Spark publication-title: Information Processing & Management doi: 10.1016/j.ipm.2023.103577 – volume: 60 issue: 3 year: 2023 ident: 10.1016/j.ipm.2024.103746_b57 article-title: Developing scalable management information system with big financial data using data mart and mining architecture publication-title: Information Processing & Management doi: 10.1016/j.ipm.2023.103326 – volume: 30 start-page: 1081 issue: 4 year: 2002 ident: 10.1016/j.ipm.2024.103746_b30 article-title: Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size publication-title: The Annals of Statistics doi: 10.1214/aos/1031689018 – volume: 148 start-page: 105 year: 2019 ident: 10.1016/j.ipm.2024.103746_b16 article-title: A distributed data management system to support large-scale data analysis publication-title: Journal of Systems and Software doi: 10.1016/j.jss.2018.11.007 – volume: 129 year: 2024 ident: 10.1016/j.ipm.2024.103746_b49 article-title: Non-MapReduce computing for intelligent big data analysis publication-title: Engineering Applications of Artificial Intelligence doi: 10.1016/j.engappai.2023.107648 – ident: 10.1016/j.ipm.2024.103746_b11 doi: 10.1145/3534678.3539377 – ident: 10.1016/j.ipm.2024.103746_b39 doi: 10.1145/3299869.3300077 – start-page: 531 year: 2013 ident: 10.1016/j.ipm.2024.103746_b35 article-title: Scalable simple random sampling and stratified sampling – volume: 120 start-page: 323 year: 2018 ident: 10.1016/j.ipm.2024.103746_b51 article-title: Enhancing in-memory efficiency for MapReduce-based data processing publication-title: Journal of Parallel and Distributed Computing doi: 10.1016/j.jpdc.2018.04.001 – volume: 15 issue: 2 year: 2020 ident: 10.1016/j.ipm.2024.103746_b25 article-title: A solution to minimum sample size for regressions publication-title: PLoS One doi: 10.1371/journal.pone.0229345 – volume: 135 year: 2023 ident: 10.1016/j.ipm.2024.103746_b4 article-title: Finding compact and well-separated clusters: Clustering using silhouette coefficients publication-title: Pattern Recognition doi: 10.1016/j.patcog.2022.109144 – volume: 60 issue: 3 year: 2023 ident: 10.1016/j.ipm.2024.103746_b56 article-title: Optimized hadoop map reduce system for strong analytics of cloud big product data on amazon web service publication-title: Information Processing & Management doi: 10.1016/j.ipm.2023.103271 – volume: 8 start-page: 72713 year: 2020 ident: 10.1016/j.ipm.2024.103746_b32 article-title: Sampling for big data profiling: A survey publication-title: IEEE Access doi: 10.1109/ACCESS.2020.2988120 – start-page: 541 year: 2020 ident: 10.1016/j.ipm.2024.103746_b36 article-title: Random sampling for group-by queries – volume: 57 issue: 4 year: 2020 ident: 10.1016/j.ipm.2024.103746_b12 article-title: Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling publication-title: Information Processing & Management doi: 10.1016/j.ipm.2020.102263 – volume: 13 start-page: 390 issue: 3 year: 2019 ident: 10.1016/j.ipm.2024.103746_b53 article-title: Learning to sample: counting with complex queries publication-title: Proceedings of the VLDB Endowment doi: 10.14778/3368289.3368302 – volume: 59 issue: 1 year: 2022 ident: 10.1016/j.ipm.2024.103746_b38 article-title: Information matching model and multi-angle tracking algorithm for loan loss-linking customers based on the family mobile social-contact big data network publication-title: Information Processing & Management doi: 10.1016/j.ipm.2021.102742 – volume: 15 start-page: 5846 issue: 11 year: 2019 ident: 10.1016/j.ipm.2024.103746_b41 article-title: Random sample partition: a distributed data model for big data analysis publication-title: IEEE Transactions on Industrial Informatics doi: 10.1109/TII.2019.2912723 – start-page: 631 year: 2005 ident: 10.1016/j.ipm.2024.103746_b42 article-title: A dynamic adaptive sampling algorithm (dasa) for real world applications: Finger print recognition and face recognition – year: 1998 ident: 10.1016/j.ipm.2024.103746_b3 – ident: 10.1016/j.ipm.2024.103746_b18 doi: 10.1145/3477314.3507311 – volume: 5 start-page: 4308 issue: 1 year: 2014 ident: 10.1016/j.ipm.2024.103746_b5 article-title: Searching for exotic particles in high-energy physics with deep learning publication-title: Nature Communications doi: 10.1038/ncomms5308 – volume: 58 start-page: 13 issue: 301 year: 1963 ident: 10.1016/j.ipm.2024.103746_b21 article-title: Probability inequalities for sums of bounded random variables publication-title: Publications of the American Statistical Association doi: 10.1080/01621459.1963.10500830 – volume: 45 start-page: 5 year: 2001 ident: 10.1016/j.ipm.2024.103746_b7 article-title: Random forests publication-title: Machine Learning doi: 10.1023/A:1010933404324 – volume: 3 start-page: 85 issue: 2 year: 2020 ident: 10.1016/j.ipm.2024.103746_b34 article-title: A survey of data partitioning and sampling methods to support big data analysis publication-title: Big Data Mining and Analytics doi: 10.26599/BDMA.2019.9020015 – volume: 51 start-page: 107 issue: 1 year: 2008 ident: 10.1016/j.ipm.2024.103746_b14 article-title: MapReduce: simplified data processing on large clusters publication-title: Communications of the ACM doi: 10.1145/1327452.1327492 – volume: 44 start-page: 3366 issue: 7 year: 2021 ident: 10.1016/j.ipm.2024.103746_b13 article-title: A continual learning survey: Defying forgetting in classification tasks publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence  | 
    
| SSID | ssj0004512 | 
    
| Score | 2.417407 | 
    
| Snippet | The sampling-based approximation method has demonstrated its potential in various domains such as machine learning, query processing, and data analysis. Most... | 
    
| SourceID | unpaywall crossref elsevier  | 
    
| SourceType | Open Access Repository Enrichment Source Index Database Publisher  | 
    
| StartPage | 103746 | 
    
| SubjectTerms | Big data analysis Block-level sampling Random sample partition Scalable sampling  | 
    
| SummonAdditionalLinks | – databaseName: Elsevier ScienceDirect dbid: .~1 link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3PS8MwFA5jF72IP3HqJAfxoNQ1bZY03kZ1DEEPzsFuJU3SMRm1uA3x4t9uXpuOCTLBY0seKe-l77023_cFoYsokL6SWngsVMqjXRV6kcxCL2Up9znnghn43_H4xAYj-jDujhsorrkwAKt0ub_K6WW2dnc6zpudYjrtDKHbpV0OinKlEBow2CmHUwxuvsiaYjhxOwnMg9H1zmaJ8ZoWQEYPaEk9hx7499q0tcwL-fkhZ7O12tPfRTuuacS96rn2UMPk-6jtKAf4EjtOEfgYu5f1AMXxXf95eIt7eG7jAAwpPJeAH88nuFYSx9YOm1JFwhYfnE4nGCCjWDqpkkM06t-_xAPPHZngqZD6C_s1qLlmMgA-bUBYoGWYakGUNpIKrUnAhG25JCGKEO37MuPaRDQSvgiYSTMZHqFm_pabY4QDTbopywwzYMuyVEoZZUYJobmxTVgL-bWzEuX0xOFYi1lSA8deE-vfBPybVP5toauVSVGJaWwaTOsIJD9WRGKT_Saz61W0_p7k5H-TnKJtuKpgu2eouXhfmrZtThbpebn6vgEiRN_d priority: 102 providerName: Elsevier  | 
    
| Title | CDFRS: A scalable sampling approach for efficient big data analysis | 
    
| URI | https://dx.doi.org/10.1016/j.ipm.2024.103746 https://doi.org/10.1016/j.ipm.2024.103746  | 
    
| UnpaywallVersion | publishedVersion | 
    
| Volume | 61 | 
    
| hasFullText | 1 | 
    
| inHoldings | 1 | 
    
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVESC databaseName: Baden-Württemberg Complete Freedom Collection (Elsevier) issn: 0306-4573 databaseCode: GBLVA dateStart: 20110101 customDbUrl: isFulltext: true dateEnd: 99991231 titleUrlDefault: https://www.sciencedirect.com omitProxy: true ssIdentifier: ssj0004512 providerName: Elsevier – providerCode: PRVESC databaseName: Elsevier ScienceDirect issn: 0306-4573 databaseCode: .~1 dateStart: 19950101 customDbUrl: isFulltext: true dateEnd: 99991231 titleUrlDefault: https://www.sciencedirect.com omitProxy: true ssIdentifier: ssj0004512 providerName: Elsevier – providerCode: PRVESC databaseName: Elsevier SD Complete Freedom Collection [SCCMFC] issn: 0306-4573 databaseCode: ACRLP dateStart: 19950101 customDbUrl: isFulltext: true dateEnd: 99991231 titleUrlDefault: https://www.sciencedirect.com omitProxy: true ssIdentifier: ssj0004512 providerName: Elsevier – providerCode: PRVESC databaseName: Elsevier SD Freedom Collection Journals [SCFCJ] issn: 0306-4573 databaseCode: AIKHN dateStart: 19950101 customDbUrl: isFulltext: true dateEnd: 99991231 titleUrlDefault: https://www.sciencedirect.com omitProxy: true ssIdentifier: ssj0004512 providerName: Elsevier – providerCode: PRVLSH databaseName: Elsevier Journals issn: 0306-4573 databaseCode: AKRWK dateStart: 19750101 customDbUrl: isFulltext: true mediaType: online dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0004512 providerName: Library Specific Holdings  | 
    
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1dS8MwFA1ue_DJb3HiRh7EB6Wj6UfS-DamYyoOUQfzqeSrYzrqcBuiD_52kzYdKjr1Pbcp9ybktPecEwD2I4-5gknqYF8IJwiF70Qs8R2OOXEJIRQr87_jsos7veC8H_atWbTRwnzq32c8rOHYCMa9IJOHB7gEKjjUsLsMKr3uVfMu7xJgPU3WTUYR8Z3QJ6joYH73jJ_OoOVZOmYvz2w0-nDGtFdzdtYksyY01JKHxmzKG-L1i3Hjn15_DaxYpAmb-dJYB0sq3QA1q1OAB9AKkUxhoN3hm6DVOmlf3xzDJpzo4hlZFZwwQzpPB7CwH4c6DqrMekKfWJAPB9DwTCGz_iZboNc-vW11HHvPgiP8wJ3qT0hJJGaeEeF6CHuS-VxSJKRiAZUSeZhqnMYQEghJ12UJkSoKIupSDyueMH8blNPHVO0A6EkUcpworEwsTjhjLEqUoFQSpZFbFbhF5mNhTcjNXRijuGCb3cc6Y7HJWJxnrAoO5yHj3IFj0eCgKGdsIUQODWJdmUVhR_PS_z7J7r9G74Hy9Gmmahq7THkdlBpvqA4qzbOLTrdu1_A7At_qmQ | 
    
| linkProvider | Unpaywall | 
    
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8QwEA7retCL-MTXag7iQanbpGnSeJPVZX3twQd4K2mSyspSF10RL_52M20qK8gKXtsMKd8kk0kz3xeE9hKqQq2MDHikdcBiHQWJyqMg45kIhRCSW_jfcd3nvXt28RA_NFCn5sJAWaWP_VVML6O1f9L2aLZHg0H7FrJdFgtQlCuF0GbQLIupgB3Y0SeZkAwn_iiBB9C8Ptosi7wGI2CjU1ZyzyEJ_n1xmnsrRurjXQ2HE4tPdxEt-KwRn1QftoQatlhGLc85wPvYk4oAZOxn6wrqdE67N7fH-AS_OkcARQq_KiggLx5xLSWOnR22pYyEW31wNnjEUDOKldcqWUX33bO7Ti_wdyYEOmLh2G0HjTBcUSDUUsKpUVFmJNHGKiaNIZRLl3MpQjQhJgxVLoxNWCJDSbnNchWtoWbxXNh1hKkhccZzyy3Y8jxTSiW51VIaYV0WtoHCGqxUe0FxuNdimNaVY0-pwzcFfNMK3w108G0yqtQ0pjVmtQfSH0MiddF-mtnht7f-7mTzf53sorne3fVVenXev9xC8_CmquHdRs3xy5ttuUxlnO2UI_ELeFXjAA | 
    
| linkToUnpaywall | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1dS8MwFA26PfjktzhxkgfxQelo2jRpfBvTMQSHqIP5VPLVoY46XIforzdp06GiU99zm3JvQk57zzkB4DAOuC-5Yh4JpfRwJEMv5mnoCSKoTyllRNv_HZd90hvgi2E0dGbRVgvzqX9f8LDuJ1YwHuBCHo7JMqiTyMDuGqgP-lftu7JLQMw0RTcZxTT0opCiqoP53TN-OoNWZtmEv77w8fjDGdNdK9lZ08Ka0FJLHluzXLTk2xfjxj-9_jpYdUgTtsulsQGWdLYJmk6nAI-gEyLZwkC3w7dAp3PWvb45hW04NcWzsio45ZZ0no1gZT8OTRzUhfWEObGguB9ByzOF3PmbbINB9_y20_PcPQueDLGfm09IRRXhgRXhBogEiodCMSSV5pgphQLCDE7jCEmElO_zlCod45j5LCBapDzcAbXsKdO7AAYKRYKkmmgbS1LBOY9TLRlTVBvk1gB-lflEOhNyexfGOKnYZg-JyVhiM5aUGWuA43nIpHTgWDQYV-VMHIQooUFiKrMo7GRe-t8n2fvX6H1Qy59nummwSy4O3Kp9B0UJ6A0 | 
    
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=CDFRS%3A+A+scalable+sampling+approach+for+efficient+big+data+analysis&rft.jtitle=Information+processing+%26+management&rft.au=Cai%2C+Yongda&rft.au=Wu%2C+Dingming&rft.au=Sun%2C+Xudong&rft.au=Wu%2C+Siyue&rft.date=2024-07-01&rft.issn=0306-4573&rft.volume=61&rft.issue=4&rft.spage=103746&rft_id=info:doi/10.1016%2Fj.ipm.2024.103746&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_ipm_2024_103746 | 
    
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0306-4573&client=summon | 
    
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0306-4573&client=summon | 
    
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0306-4573&client=summon |