Intelligent sampling for big data using bootstrap sampling and chebyshev inequality
The amount of data being generated and stored is growing exponentially, owed in part to the continuing advances in computer technology. These data present tremendous opportunities in data mining, a burgeoning field in computer science that focuses on the development of methods that can extract knowl...
Saved in:
| Published in | 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE) pp. 1 - 6 |
|---|---|
| Main Author | |
| Format | Conference Proceeding |
| Language | English |
| Published |
IEEE
01.05.2014
|
| Subjects | |
| Online Access | Get full text |
| ISBN | 1479930997 9781479930999 |
| ISSN | 0840-7789 |
| DOI | 10.1109/CCECE.2014.6901029 |
Cover
| Abstract | The amount of data being generated and stored is growing exponentially, owed in part to the continuing advances in computer technology. These data present tremendous opportunities in data mining, a burgeoning field in computer science that focuses on the development of methods that can extract knowledge from data. In many real world problems, these data mining algorithms have access to massive amounts of data. Mining all the available data is prohibitive due to computational (time and memory) constraints. Much of the current research is concerned with scaling up data mining algorithms (i.e. improving on existing data mining algorithms for larger datasets). An alternative approach is to scale down the data. Thus, determining a smallest sufficient training set size that obtains the same accuracy as the entire available dataset remains an important research question. Our research focuses on selecting how many (sampling) instances to present to the data mining algorithm. The goals of this paper is to study and characterize the properties of learning curves, integrate them with Chebyshev Bound to come up with an efficient general purpose adaptive sampling schedule, and to empirically validate our algorithm for scaling down the data. |
|---|---|
| AbstractList | The amount of data being generated and stored is growing exponentially, owed in part to the continuing advances in computer technology. These data present tremendous opportunities in data mining, a burgeoning field in computer science that focuses on the development of methods that can extract knowledge from data. In many real world problems, these data mining algorithms have access to massive amounts of data. Mining all the available data is prohibitive due to computational (time and memory) constraints. Much of the current research is concerned with scaling up data mining algorithms (i.e. improving on existing data mining algorithms for larger datasets). An alternative approach is to scale down the data. Thus, determining a smallest sufficient training set size that obtains the same accuracy as the entire available dataset remains an important research question. Our research focuses on selecting how many (sampling) instances to present to the data mining algorithm. The goals of this paper is to study and characterize the properties of learning curves, integrate them with Chebyshev Bound to come up with an efficient general purpose adaptive sampling schedule, and to empirically validate our algorithm for scaling down the data. |
| Author | Satyanarayana, Ashwin |
| Author_xml | – sequence: 1 givenname: Ashwin surname: Satyanarayana fullname: Satyanarayana, Ashwin email: asatyanarayana@citytech.cuny.edu organization: Comput. Syst. Technol., New York City Coll. of Technol., New York, NY, USA |
| BookMark | eNpFkFFLwzAUhSNOcJ37A_qSP9B506ZN76OUqYOBD-rzSJObLdKls8mE_XuRDXw6nI_D93AyNglDIMbuBSyEAHxs22W7XBQg5KJGEFDgFZujaoRUiKUAgdcsOxdAVBM2hUZCrlSDtyyL8QsAZFPLKXtfhUR977cUEo96f-h92HI3jLzzW2510vwY_1A3DCmmUR_-VzpYbnbUneKOfrgP9H3UvU-nO3bjdB9pfskZ-3xefrSv-frtZdU-rXMvVJXygrrGlc4SSKU7sMbZ2qrKahSFQiNIdc4AFXUlrSTtlC6FKYypJKGzDssZezh7PRFtDqPf6_G0uRxS_gJbZlcc |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/CCECE.2014.6901029 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering |
| EISBN | 9781479931019 1479931012 |
| EndPage | 6 |
| ExternalDocumentID | 6901029 |
| Genre | orig-research |
| GroupedDBID | 29F 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS |
| ID | FETCH-LOGICAL-i175t-2eb8f3fde047ab0dcfd6d75da91279c1e7bfc0e2654d4eaf7a31c2cc54e9fdf93 |
| IEDL.DBID | RIE |
| ISBN | 1479930997 9781479930999 |
| ISSN | 0840-7789 |
| IngestDate | Wed Aug 27 04:22:25 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i175t-2eb8f3fde047ab0dcfd6d75da91279c1e7bfc0e2654d4eaf7a31c2cc54e9fdf93 |
| PageCount | 6 |
| ParticipantIDs | ieee_primary_6901029 |
| PublicationCentury | 2000 |
| PublicationDate | 2014-May |
| PublicationDateYYYYMMDD | 2014-05-01 |
| PublicationDate_xml | – month: 05 year: 2014 text: 2014-May |
| PublicationDecade | 2010 |
| PublicationTitle | 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE) |
| PublicationTitleAbbrev | CCECE |
| PublicationYear | 2014 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0004864 ssj0001773916 |
| Score | 1.6568497 |
| Snippet | The amount of data being generated and stored is growing exponentially, owed in part to the continuing advances in computer technology. These data present... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Chebyshev approximation Convergence Light emitting diodes Tin |
| Title | Intelligent sampling for big data using bootstrap sampling and chebyshev inequality |
| URI | https://ieeexplore.ieee.org/document/6901029 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA7bTnrxxyb-JgePtmubNGnOZTIFRdDBbiNNXsYQuqGdoH-9SdNuKh68paG8pkngvbx83_cQuopSYwqn8sklJQHVVAYyilXAMwYkJUYxD5B9YOMJvZum0w663nBhAKAGn0HomvVdvl6qtUuVDV3xpCgRXdS1pjxXa5tP4dxzSFtOZOalozKHmOOZqEld3HpjRxVttZ6aZ9GyaSIxzPNRPnKQLxo2n_tRd6V2Ozd76L4dsEebvITrqgjV5y8tx__-0T4abAl--HHjug5QB8pDtPtNm7CPnm43Yp0VfpMOeF7OsQ1xcbGYYwcsxQ4zP8c2TK9cvmS1fUuWGtvN4FLf8I6tTU_d_Bigyc3oOR8HTQWGYGHDiipIoMgMMRoiymURaWU00zzVUsQJFyoGXhgVQcJSu8ogDZckVolSKQVhtBHkCPXKZQnHCAunTMjs-U7yjCaGZczaAk4U04oRxk9Q303PbOVFNmbNzJz-3X2GdtwSeeThOepVr2u4sNFBVVzW2-ILbnC0Eg |
| linkProvider | IEEE |
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1bT8IwFG4QH9QXL2C82wcf3Rhr167PCwQUiImQ8Ea6XggxGUSHif5623UDNT741jZNu16Sc3b6fd8B4C6ItE6tyiflGHlYYu7xoC08GhOFIqQFcQDZEelN8MM0mtbA_YYLo5QqwGfKt8XiLV8uxdqGylo2eVIQsh2wG2GMI8fW2kZUKHUs0ooVGTvxqNhi5mjMCloXNfbYkkUrtaeyzio-TcBaSdJJOhb0hf1ywh-ZVwrD0z0Ew-qTHd7kxV_nqS8-f6k5_ndNR6C5pfjBp43xOgY1lZ2Ag2_qhA3w3N_IdebwjVvoeTaHxsmF6WIOLbQUWtT8HBpHPbcRk9W2F88kNNfBBr_VOzRjOvLmRxNMup1x0vPKHAzewjgWuReqNNZISxVgytNACi2JpJHkrB1SJtqKploEKiSROWfFNeWoLUIhIqyYlpqhU1DPlpk6A5BZbUJi_vA4jXGoSUzMWIoiQaQgiNBz0LDbM1s5mY1ZuTMXfzffgr3eeDiYDfqjx0uwb4_L4RCvQD1_Xatr4yvk6U1xRb4AECq3Xw |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2014+IEEE+27th+Canadian+Conference+on+Electrical+and+Computer+Engineering+%28CCECE%29&rft.atitle=Intelligent+sampling+for+big+data+using+bootstrap+sampling+and+chebyshev+inequality&rft.au=Satyanarayana%2C+Ashwin&rft.date=2014-05-01&rft.pub=IEEE&rft.isbn=1479930997&rft.issn=0840-7789&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FCCECE.2014.6901029&rft.externalDocID=6901029 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0840-7789&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0840-7789&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0840-7789&client=summon |