Intelligent sampling for big data using bootstrap sampling and chebyshev inequality

The amount of data being generated and stored is growing exponentially, owed in part to the continuing advances in computer technology. These data present tremendous opportunities in data mining, a burgeoning field in computer science that focuses on the development of methods that can extract knowl...

Full description

Saved in:
Bibliographic Details
Published in2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE) pp. 1 - 6
Main Author Satyanarayana, Ashwin
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.05.2014
Subjects
Online AccessGet full text
ISBN1479930997
9781479930999
ISSN0840-7789
DOI10.1109/CCECE.2014.6901029

Cover

Abstract The amount of data being generated and stored is growing exponentially, owed in part to the continuing advances in computer technology. These data present tremendous opportunities in data mining, a burgeoning field in computer science that focuses on the development of methods that can extract knowledge from data. In many real world problems, these data mining algorithms have access to massive amounts of data. Mining all the available data is prohibitive due to computational (time and memory) constraints. Much of the current research is concerned with scaling up data mining algorithms (i.e. improving on existing data mining algorithms for larger datasets). An alternative approach is to scale down the data. Thus, determining a smallest sufficient training set size that obtains the same accuracy as the entire available dataset remains an important research question. Our research focuses on selecting how many (sampling) instances to present to the data mining algorithm. The goals of this paper is to study and characterize the properties of learning curves, integrate them with Chebyshev Bound to come up with an efficient general purpose adaptive sampling schedule, and to empirically validate our algorithm for scaling down the data.
AbstractList The amount of data being generated and stored is growing exponentially, owed in part to the continuing advances in computer technology. These data present tremendous opportunities in data mining, a burgeoning field in computer science that focuses on the development of methods that can extract knowledge from data. In many real world problems, these data mining algorithms have access to massive amounts of data. Mining all the available data is prohibitive due to computational (time and memory) constraints. Much of the current research is concerned with scaling up data mining algorithms (i.e. improving on existing data mining algorithms for larger datasets). An alternative approach is to scale down the data. Thus, determining a smallest sufficient training set size that obtains the same accuracy as the entire available dataset remains an important research question. Our research focuses on selecting how many (sampling) instances to present to the data mining algorithm. The goals of this paper is to study and characterize the properties of learning curves, integrate them with Chebyshev Bound to come up with an efficient general purpose adaptive sampling schedule, and to empirically validate our algorithm for scaling down the data.
Author Satyanarayana, Ashwin
Author_xml – sequence: 1
  givenname: Ashwin
  surname: Satyanarayana
  fullname: Satyanarayana, Ashwin
  email: asatyanarayana@citytech.cuny.edu
  organization: Comput. Syst. Technol., New York City Coll. of Technol., New York, NY, USA
BookMark eNpFkFFLwzAUhSNOcJ37A_qSP9B506ZN76OUqYOBD-rzSJObLdKls8mE_XuRDXw6nI_D93AyNglDIMbuBSyEAHxs22W7XBQg5KJGEFDgFZujaoRUiKUAgdcsOxdAVBM2hUZCrlSDtyyL8QsAZFPLKXtfhUR977cUEo96f-h92HI3jLzzW2510vwY_1A3DCmmUR_-VzpYbnbUneKOfrgP9H3UvU-nO3bjdB9pfskZ-3xefrSv-frtZdU-rXMvVJXygrrGlc4SSKU7sMbZ2qrKahSFQiNIdc4AFXUlrSTtlC6FKYypJKGzDssZezh7PRFtDqPf6_G0uRxS_gJbZlcc
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CCECE.2014.6901029
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISBN 9781479931019
1479931012
EndPage 6
ExternalDocumentID 6901029
Genre orig-research
GroupedDBID 29F
6IE
6IF
6IH
6IK
6IL
6IM
6IN
AAJGR
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
RNS
ID FETCH-LOGICAL-i175t-2eb8f3fde047ab0dcfd6d75da91279c1e7bfc0e2654d4eaf7a31c2cc54e9fdf93
IEDL.DBID RIE
ISBN 1479930997
9781479930999
ISSN 0840-7789
IngestDate Wed Aug 27 04:22:25 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i175t-2eb8f3fde047ab0dcfd6d75da91279c1e7bfc0e2654d4eaf7a31c2cc54e9fdf93
PageCount 6
ParticipantIDs ieee_primary_6901029
PublicationCentury 2000
PublicationDate 2014-May
PublicationDateYYYYMMDD 2014-05-01
PublicationDate_xml – month: 05
  year: 2014
  text: 2014-May
PublicationDecade 2010
PublicationTitle 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE)
PublicationTitleAbbrev CCECE
PublicationYear 2014
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0004864
ssj0001773916
Score 1.6568497
Snippet The amount of data being generated and stored is growing exponentially, owed in part to the continuing advances in computer technology. These data present...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Chebyshev approximation
Convergence
Light emitting diodes
Tin
Title Intelligent sampling for big data using bootstrap sampling and chebyshev inequality
URI https://ieeexplore.ieee.org/document/6901029
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA7bTnrxxyb-JgePtmubNGnOZTIFRdDBbiNNXsYQuqGdoH-9SdNuKh68paG8pkngvbx83_cQuopSYwqn8sklJQHVVAYyilXAMwYkJUYxD5B9YOMJvZum0w663nBhAKAGn0HomvVdvl6qtUuVDV3xpCgRXdS1pjxXa5tP4dxzSFtOZOalozKHmOOZqEld3HpjRxVttZ6aZ9GyaSIxzPNRPnKQLxo2n_tRd6V2Ozd76L4dsEebvITrqgjV5y8tx__-0T4abAl--HHjug5QB8pDtPtNm7CPnm43Yp0VfpMOeF7OsQ1xcbGYYwcsxQ4zP8c2TK9cvmS1fUuWGtvN4FLf8I6tTU_d_Bigyc3oOR8HTQWGYGHDiipIoMgMMRoiymURaWU00zzVUsQJFyoGXhgVQcJSu8ogDZckVolSKQVhtBHkCPXKZQnHCAunTMjs-U7yjCaGZczaAk4U04oRxk9Q303PbOVFNmbNzJz-3X2GdtwSeeThOepVr2u4sNFBVVzW2-ILbnC0Eg
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1bT8IwFG4QH9QXL2C82wcf3Rhr167PCwQUiImQ8Ea6XggxGUSHif5623UDNT741jZNu16Sc3b6fd8B4C6ItE6tyiflGHlYYu7xoC08GhOFIqQFcQDZEelN8MM0mtbA_YYLo5QqwGfKt8XiLV8uxdqGylo2eVIQsh2wG2GMI8fW2kZUKHUs0ooVGTvxqNhi5mjMCloXNfbYkkUrtaeyzio-TcBaSdJJOhb0hf1ywh-ZVwrD0z0Ew-qTHd7kxV_nqS8-f6k5_ndNR6C5pfjBp43xOgY1lZ2Ag2_qhA3w3N_IdebwjVvoeTaHxsmF6WIOLbQUWtT8HBpHPbcRk9W2F88kNNfBBr_VOzRjOvLmRxNMup1x0vPKHAzewjgWuReqNNZISxVgytNACi2JpJHkrB1SJtqKploEKiSROWfFNeWoLUIhIqyYlpqhU1DPlpk6A5BZbUJi_vA4jXGoSUzMWIoiQaQgiNBz0LDbM1s5mY1ZuTMXfzffgr3eeDiYDfqjx0uwb4_L4RCvQD1_Xatr4yvk6U1xRb4AECq3Xw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2014+IEEE+27th+Canadian+Conference+on+Electrical+and+Computer+Engineering+%28CCECE%29&rft.atitle=Intelligent+sampling+for+big+data+using+bootstrap+sampling+and+chebyshev+inequality&rft.au=Satyanarayana%2C+Ashwin&rft.date=2014-05-01&rft.pub=IEEE&rft.isbn=1479930997&rft.issn=0840-7789&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FCCECE.2014.6901029&rft.externalDocID=6901029
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0840-7789&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0840-7789&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0840-7789&client=summon