EM: An EM Algorithm for Big Data

Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems...

Full description

Saved in:
Bibliographic Details
Published in2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) pp. 312 - 320
Main Authors Kurban, Hasan, Jenne, Mark, Dalkilic, Mehmet M.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.10.2016
Subjects
Online AccessGet full text
DOI10.1109/DSAA.2016.40

Cover

Abstract Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like Expectation Maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a non-linear hierarchical data structure(heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM*. We show our EM* algorithm outperform EM-T algorithm over large real world and synthetic data sets. We lastly conclude with some theoretic underpinnings that explain why EM* is successful.
AbstractList Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like Expectation Maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a non-linear hierarchical data structure(heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM*. We show our EM* algorithm outperform EM-T algorithm over large real world and synthetic data sets. We lastly conclude with some theoretic underpinnings that explain why EM* is successful.
Author Dalkilic, Mehmet M.
Kurban, Hasan
Jenne, Mark
Author_xml – sequence: 1
  givenname: Hasan
  surname: Kurban
  fullname: Kurban, Hasan
  email: hakurban@indiana.edu
  organization: Comput. Sci. Dept., Indiana Univ., Bloomington, IN, USA
– sequence: 2
  givenname: Mark
  surname: Jenne
  fullname: Jenne, Mark
  email: mjenne@indiana.edu
  organization: Comput. Sci. Dept., Indiana Univ., Bloomington, IN, USA
– sequence: 3
  givenname: Mehmet M.
  surname: Dalkilic
  fullname: Dalkilic, Mehmet M.
  email: dalkilic@indiana.edu
  organization: Comput. Sci. Dept., Indiana Univ., Bloomington, IN, USA
BookMark eNotjLtOAzEQAI0EBYR0dDT-gTu86zedyQOQEqUA6mitrIOl5A4d1_D3nATFaJrR3IjLru9YiDtQLYCKD8u3lFpU4FqjLsQ8-gBWRWVROXct5Gr7KFM3SabTsR_q-HmWpR_kUz3KJY10K64Knb55_u-Z-Fiv3hcvzWb3_LpIm6aCt2MTcjSFsWTrXPAUDHjECaNz0IgUuBwKBnNQaCiT00Q8FYygOGK2eibu_76VmfdfQz3T8LP3ProIXv8C4-s3Ow
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/DSAA.2016.40
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9781509052066
1509052062
EndPage 320
ExternalDocumentID 7796917
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i175t-8b94fe2fb56687a84172217243b8322a8efdf284d024aba63aae722e210e92b53
IEDL.DBID RIE
IngestDate Thu Jan 18 11:14:49 EST 2024
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i175t-8b94fe2fb56687a84172217243b8322a8efdf284d024aba63aae722e210e92b53
PageCount 9
ParticipantIDs ieee_primary_7796917
PublicationCentury 2000
PublicationDate 2016-Oct.
PublicationDateYYYYMMDD 2016-10-01
PublicationDate_xml – month: 10
  year: 2016
  text: 2016-Oct.
PublicationDecade 2010
PublicationTitle 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
PublicationTitleAbbrev DSAA
PublicationYear 2016
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.6662705
Snippet Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and,...
SourceID ieee
SourceType Publisher
StartPage 312
SubjectTerms Big Data
clustering
Clustering algorithms
Convergence
Covariance matrices
Data mining
expectation maximization
Gaussian distribution
heap
Iterative algorithms
Title EM: An EM Algorithm for Big Data
URI https://ieeexplore.ieee.org/document/7796917
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFH5sO3lS2cTf5ODRdmPp0tRbdRtDmAg62G3kJa9zqK1Id_Gv96WdU8SDp5QQaJNQvu97-V4ewAVGRiobZYFz_phRogwMKh1YkoYQWbFYH--Y3qnJLLqdD-YNuNzmwhBRZT6j0D9WZ_musGsfKuvGcaJYXjShGWtV52ptvexJd_iQpt6rpUIfyfhRK6WCivEuTL9eUjtEnsN1iaH9-HX_4n-_Yg8630l54n4LN_vQoLwNYjS9EmnOjUhflgUr_adXwTxUXK-WYmhK04HZePR4Mwk2RQ-CFSN5GWhMooz6GTLP0rHRETMMX0SKl9H_fEZT5jLGFMfgatAoaQzxCGLpRkkfB_IAWnmR0yGISNskI6YwtidZxg3QSuYXzvVJWUXYO4K2n93irb7XYrGZ2PHf3Sew4xe3NrKdQqt8X9MZA3KJ59VOfAJA64sQ
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFH5BPOhJDRh_24NHNwjtyuZtCgSVERMh4Ub64w2JOowZF_96XzdEYzx46tI02dpm-b7v9Xt9ABdaKC6NSD1r3TEj19xTWoaeQa5Qa1IsxsU7kqHsj8XdJJhU4HKdC4OIhfkMffdYnOXbhVm6UFmj3Y4kyYsN2AyEEEGZrbV2s0eNzmMcO7eW9F0s40e1lAIsejuQfL2m9Ig8-8tc--bj1w2M__2OXah_p-WxhzXg7EEFsxqwbnLF4owaFr_MFqT1n14ZMVF2PZ-xjspVHca97uim763KHnhzwvLcC3UkUmylmphW2FahII7hykjRQrrfT4WY2pRQxRK8Kq0kVwppBJJ4w6ilA74P1WyR4QEwEZooRSIxpslJyAXacGIY1rZQGom6eQg1N7vpW3mzxXQ1saO_u89hqz9KBtPB7fD-GLbdQpe2thOo5u9LPCV4zvVZsSufwbmOXQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2016+IEEE+International+Conference+on+Data+Science+and+Advanced+Analytics+%28DSAA%29&rft.atitle=EM%3A+An+EM+Algorithm+for+Big+Data&rft.au=Kurban%2C+Hasan&rft.au=Jenne%2C+Mark&rft.au=Dalkilic%2C+Mehmet+M.&rft.date=2016-10-01&rft.pub=IEEE&rft.spage=312&rft.epage=320&rft_id=info:doi/10.1109%2FDSAA.2016.40&rft.externalDocID=7796917