Intelligent sampling for big data using bootstrap sampling and chebyshev inequality

The amount of data being generated and stored is growing exponentially, owed in part to the continuing advances in computer technology. These data present tremendous opportunities in data mining, a burgeoning field in computer science that focuses on the development of methods that can extract knowl...

Full description

Saved in:

Bibliographic Details
Published in	2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE) pp. 1 - 6
Main Author	Satyanarayana, Ashwin
Format	Conference Proceeding
Language	English
Published	IEEE 01.05.2014
Subjects	Chebyshev approximation Convergence Light emitting diodes Tin
Online Access	Get full text
ISBN	1479930997 9781479930999
ISSN	0840-7789
DOI	10.1109/CCECE.2014.6901029

Cover

Abstract	The amount of data being generated and stored is growing exponentially, owed in part to the continuing advances in computer technology. These data present tremendous opportunities in data mining, a burgeoning field in computer science that focuses on the development of methods that can extract knowledge from data. In many real world problems, these data mining algorithms have access to massive amounts of data. Mining all the available data is prohibitive due to computational (time and memory) constraints. Much of the current research is concerned with scaling up data mining algorithms (i.e. improving on existing data mining algorithms for larger datasets). An alternative approach is to scale down the data. Thus, determining a smallest sufficient training set size that obtains the same accuracy as the entire available dataset remains an important research question. Our research focuses on selecting how many (sampling) instances to present to the data mining algorithm. The goals of this paper is to study and characterize the properties of learning curves, integrate them with Chebyshev Bound to come up with an efficient general purpose adaptive sampling schedule, and to empirically validate our algorithm for scaling down the data.
AbstractList	The amount of data being generated and stored is growing exponentially, owed in part to the continuing advances in computer technology. These data present tremendous opportunities in data mining, a burgeoning field in computer science that focuses on the development of methods that can extract knowledge from data. In many real world problems, these data mining algorithms have access to massive amounts of data. Mining all the available data is prohibitive due to computational (time and memory) constraints. Much of the current research is concerned with scaling up data mining algorithms (i.e. improving on existing data mining algorithms for larger datasets). An alternative approach is to scale down the data. Thus, determining a smallest sufficient training set size that obtains the same accuracy as the entire available dataset remains an important research question. Our research focuses on selecting how many (sampling) instances to present to the data mining algorithm. The goals of this paper is to study and characterize the properties of learning curves, integrate them with Chebyshev Bound to come up with an efficient general purpose adaptive sampling schedule, and to empirically validate our algorithm for scaling down the data.
Author	Satyanarayana, Ashwin
Author_xml	– sequence: 1 givenname: Ashwin surname: Satyanarayana fullname: Satyanarayana, Ashwin email: asatyanarayana@citytech.cuny.edu organization: Comput. Syst. Technol., New York City Coll. of Technol., New York, NY, USA
BookMark	eNpFkFFLwzAUhSNOcJ37A_qSP9B506ZN76OUqYOBD-rzSJObLdKls8mE_XuRDXw6nI_D93AyNglDIMbuBSyEAHxs22W7XBQg5KJGEFDgFZujaoRUiKUAgdcsOxdAVBM2hUZCrlSDtyyL8QsAZFPLKXtfhUR977cUEo96f-h92HI3jLzzW2510vwY_1A3DCmmUR_-VzpYbnbUneKOfrgP9H3UvU-nO3bjdB9pfskZ-3xefrSv-frtZdU-rXMvVJXygrrGlc4SSKU7sMbZ2qrKahSFQiNIdc4AFXUlrSTtlC6FKYypJKGzDssZezh7PRFtDqPf6_G0uRxS_gJbZlcc
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/CCECE.2014.6901029
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISBN	9781479931019 1479931012
EndPage	6
ExternalDocumentID	6901029
Genre	orig-research
GroupedDBID	29F 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS
ID	FETCH-LOGICAL-i175t-2eb8f3fde047ab0dcfd6d75da91279c1e7bfc0e2654d4eaf7a31c2cc54e9fdf93
IEDL.DBID	RIE
ISBN	1479930997 9781479930999
ISSN	0840-7789
IngestDate	Wed Aug 27 04:22:25 EDT 2025
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i175t-2eb8f3fde047ab0dcfd6d75da91279c1e7bfc0e2654d4eaf7a31c2cc54e9fdf93
PageCount	6
ParticipantIDs	ieee_primary_6901029
PublicationCentury	2000
PublicationDate	2014-May
PublicationDateYYYYMMDD	2014-05-01
PublicationDate_xml	– month: 05 year: 2014 text: 2014-May
PublicationDecade	2010
PublicationTitle	2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE)
PublicationTitleAbbrev	CCECE
PublicationYear	2014
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0004864 ssj0001773916
Score	1.6568497
Snippet	The amount of data being generated and stored is growing exponentially, owed in part to the continuing advances in computer technology. These data present...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Chebyshev approximation Convergence Light emitting diodes Tin
Title	Intelligent sampling for big data using bootstrap sampling and chebyshev inequality
URI	https://ieeexplore.ieee.org/document/6901029
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA7bTnrxxyb-JgePtmubNGnOZTIFRdDBbiNNXsYQuqGdoH-9SdNuKh68paG8pkngvbx83_cQuopSYwqn8sklJQHVVAYyilXAMwYkJUYxD5B9YOMJvZum0w663nBhAKAGn0HomvVdvl6qtUuVDV3xpCgRXdS1pjxXa5tP4dxzSFtOZOalozKHmOOZqEld3HpjRxVttZ6aZ9GyaSIxzPNRPnKQLxo2n_tRd6V2Ozd76L4dsEebvITrqgjV5y8tx__-0T4abAl--HHjug5QB8pDtPtNm7CPnm43Yp0VfpMOeF7OsQ1xcbGYYwcsxQ4zP8c2TK9cvmS1fUuWGtvN4FLf8I6tTU_d_Bigyc3oOR8HTQWGYGHDiipIoMgMMRoiymURaWU00zzVUsQJFyoGXhgVQcJSu8ogDZckVolSKQVhtBHkCPXKZQnHCAunTMjs-U7yjCaGZczaAk4U04oRxk9Q303PbOVFNmbNzJz-3X2GdtwSeeThOepVr2u4sNFBVVzW2-ILbnC0Eg
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1bT8IwFG4QH9QXL2C82wcf3Rhr167PCwQUiImQ8Ea6XggxGUSHif5623UDNT741jZNu16Sc3b6fd8B4C6ItE6tyiflGHlYYu7xoC08GhOFIqQFcQDZEelN8MM0mtbA_YYLo5QqwGfKt8XiLV8uxdqGylo2eVIQsh2wG2GMI8fW2kZUKHUs0ooVGTvxqNhi5mjMCloXNfbYkkUrtaeyzio-TcBaSdJJOhb0hf1ywh-ZVwrD0z0Ew-qTHd7kxV_nqS8-f6k5_ndNR6C5pfjBp43xOgY1lZ2Ag2_qhA3w3N_IdebwjVvoeTaHxsmF6WIOLbQUWtT8HBpHPbcRk9W2F88kNNfBBr_VOzRjOvLmRxNMup1x0vPKHAzewjgWuReqNNZISxVgytNACi2JpJHkrB1SJtqKploEKiSROWfFNeWoLUIhIqyYlpqhU1DPlpk6A5BZbUJi_vA4jXGoSUzMWIoiQaQgiNBz0LDbM1s5mY1ZuTMXfzffgr3eeDiYDfqjx0uwb4_L4RCvQD1_Xatr4yvk6U1xRb4AECq3Xw
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2014+IEEE+27th+Canadian+Conference+on+Electrical+and+Computer+Engineering+%28CCECE%29&rft.atitle=Intelligent+sampling+for+big+data+using+bootstrap+sampling+and+chebyshev+inequality&rft.au=Satyanarayana%2C+Ashwin&rft.date=2014-05-01&rft.pub=IEEE&rft.isbn=1479930997&rft.issn=0840-7789&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FCCECE.2014.6901029&rft.externalDocID=6901029
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0840-7789&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0840-7789&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0840-7789&client=summon