EM: An EM Algorithm for Big Data
Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems...
        Saved in:
      
    
          | Published in | 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) pp. 312 - 320 | 
|---|---|
| Main Authors | , , | 
| Format | Conference Proceeding | 
| Language | English | 
| Published | 
            IEEE
    
        01.10.2016
     | 
| Subjects | |
| Online Access | Get full text | 
| DOI | 10.1109/DSAA.2016.40 | 
Cover
| Abstract | Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like Expectation Maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a non-linear hierarchical data structure(heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM*. We show our EM* algorithm outperform EM-T algorithm over large real world and synthetic data sets. We lastly conclude with some theoretic underpinnings that explain why EM* is successful. | 
    
|---|---|
| AbstractList | Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like Expectation Maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a non-linear hierarchical data structure(heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM*. We show our EM* algorithm outperform EM-T algorithm over large real world and synthetic data sets. We lastly conclude with some theoretic underpinnings that explain why EM* is successful. | 
    
| Author | Dalkilic, Mehmet M. Kurban, Hasan Jenne, Mark  | 
    
| Author_xml | – sequence: 1 givenname: Hasan surname: Kurban fullname: Kurban, Hasan email: hakurban@indiana.edu organization: Comput. Sci. Dept., Indiana Univ., Bloomington, IN, USA – sequence: 2 givenname: Mark surname: Jenne fullname: Jenne, Mark email: mjenne@indiana.edu organization: Comput. Sci. Dept., Indiana Univ., Bloomington, IN, USA – sequence: 3 givenname: Mehmet M. surname: Dalkilic fullname: Dalkilic, Mehmet M. email: dalkilic@indiana.edu organization: Comput. Sci. Dept., Indiana Univ., Bloomington, IN, USA  | 
    
| BookMark | eNotjLtOAzEQAI0EBYR0dDT-gTu86zedyQOQEqUA6mitrIOl5A4d1_D3nATFaJrR3IjLru9YiDtQLYCKD8u3lFpU4FqjLsQ8-gBWRWVROXct5Gr7KFM3SabTsR_q-HmWpR_kUz3KJY10K64Knb55_u-Z-Fiv3hcvzWb3_LpIm6aCt2MTcjSFsWTrXPAUDHjECaNz0IgUuBwKBnNQaCiT00Q8FYygOGK2eibu_76VmfdfQz3T8LP3ProIXv8C4-s3Ow | 
    
| CODEN | IEEPAD | 
    
| ContentType | Conference Proceeding | 
    
| DBID | 6IE 6IL CBEJK RIE RIL  | 
    
| DOI | 10.1109/DSAA.2016.40 | 
    
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present  | 
    
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher  | 
    
| DeliveryMethod | fulltext_linktorsrc | 
    
| EISBN | 9781509052066 1509052062  | 
    
| EndPage | 320 | 
    
| ExternalDocumentID | 7796917 | 
    
| Genre | orig-research | 
    
| GroupedDBID | 6IE 6IL CBEJK RIE RIL  | 
    
| ID | FETCH-LOGICAL-i175t-8b94fe2fb56687a84172217243b8322a8efdf284d024aba63aae722e210e92b53 | 
    
| IEDL.DBID | RIE | 
    
| IngestDate | Thu Jan 18 11:14:49 EST 2024 | 
    
| IsPeerReviewed | false | 
    
| IsScholarly | false | 
    
| Language | English | 
    
| LinkModel | DirectLink | 
    
| MergedId | FETCHMERGED-LOGICAL-i175t-8b94fe2fb56687a84172217243b8322a8efdf284d024aba63aae722e210e92b53 | 
    
| PageCount | 9 | 
    
| ParticipantIDs | ieee_primary_7796917 | 
    
| PublicationCentury | 2000 | 
    
| PublicationDate | 2016-Oct. | 
    
| PublicationDateYYYYMMDD | 2016-10-01 | 
    
| PublicationDate_xml | – month: 10 year: 2016 text: 2016-Oct.  | 
    
| PublicationDecade | 2010 | 
    
| PublicationTitle | 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) | 
    
| PublicationTitleAbbrev | DSAA | 
    
| PublicationYear | 2016 | 
    
| Publisher | IEEE | 
    
| Publisher_xml | – name: IEEE | 
    
| Score | 1.6662705 | 
    
| Snippet | Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and,... | 
    
| SourceID | ieee | 
    
| SourceType | Publisher | 
    
| StartPage | 312 | 
    
| SubjectTerms | Big Data clustering Clustering algorithms Convergence Covariance matrices Data mining expectation maximization Gaussian distribution heap Iterative algorithms  | 
    
| Title | EM: An EM Algorithm for Big Data | 
    
| URI | https://ieeexplore.ieee.org/document/7796917 | 
    
| hasFullText | 1 | 
    
| inHoldings | 1 | 
    
| isFullTextHit | |
| isPrint | |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFH5sO3lS2cTf5ODRdmPp0tRbdRtDmAg62G3kJa9zqK1Id_Gv96WdU8SDp5QQaJNQvu97-V4ewAVGRiobZYFz_phRogwMKh1YkoYQWbFYH--Y3qnJLLqdD-YNuNzmwhBRZT6j0D9WZ_musGsfKuvGcaJYXjShGWtV52ptvexJd_iQpt6rpUIfyfhRK6WCivEuTL9eUjtEnsN1iaH9-HX_4n-_Yg8630l54n4LN_vQoLwNYjS9EmnOjUhflgUr_adXwTxUXK-WYmhK04HZePR4Mwk2RQ-CFSN5GWhMooz6GTLP0rHRETMMX0SKl9H_fEZT5jLGFMfgatAoaQzxCGLpRkkfB_IAWnmR0yGISNskI6YwtidZxg3QSuYXzvVJWUXYO4K2n93irb7XYrGZ2PHf3Sew4xe3NrKdQqt8X9MZA3KJ59VOfAJA64sQ | 
    
| linkProvider | IEEE | 
    
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFH5BPOhJDRh_24NHNwjtyuZtCgSVERMh4Ub64w2JOowZF_96XzdEYzx46tI02dpm-b7v9Xt9ABdaKC6NSD1r3TEj19xTWoaeQa5Qa1IsxsU7kqHsj8XdJJhU4HKdC4OIhfkMffdYnOXbhVm6UFmj3Y4kyYsN2AyEEEGZrbV2s0eNzmMcO7eW9F0s40e1lAIsejuQfL2m9Ig8-8tc--bj1w2M__2OXah_p-WxhzXg7EEFsxqwbnLF4owaFr_MFqT1n14ZMVF2PZ-xjspVHca97uim763KHnhzwvLcC3UkUmylmphW2FahII7hykjRQrrfT4WY2pRQxRK8Kq0kVwppBJJ4w6ilA74P1WyR4QEwEZooRSIxpslJyAXacGIY1rZQGom6eQg1N7vpW3mzxXQ1saO_u89hqz9KBtPB7fD-GLbdQpe2thOo5u9LPCV4zvVZsSufwbmOXQ | 
    
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2016+IEEE+International+Conference+on+Data+Science+and+Advanced+Analytics+%28DSAA%29&rft.atitle=EM%3A+An+EM+Algorithm+for+Big+Data&rft.au=Kurban%2C+Hasan&rft.au=Jenne%2C+Mark&rft.au=Dalkilic%2C+Mehmet+M.&rft.date=2016-10-01&rft.pub=IEEE&rft.spage=312&rft.epage=320&rft_id=info:doi/10.1109%2FDSAA.2016.40&rft.externalDocID=7796917 |