EM: An EM Algorithm for Big Data

Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems...

Full description

Saved in:

Bibliographic Details
Published in	2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) pp. 312 - 320
Main Authors	Kurban, Hasan, Jenne, Mark, Dalkilic, Mehmet M.
Format	Conference Proceeding
Language	English
Published	IEEE 01.10.2016
Subjects	Big Data clustering Clustering algorithms Convergence Covariance matrices Data mining expectation maximization Gaussian distribution heap Iterative algorithms
Online Access	Get full text
DOI	10.1109/DSAA.2016.40

Cover

Abstract	Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like Expectation Maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a non-linear hierarchical data structure(heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM. We show our EM algorithm outperform EM-T algorithm over large real world and synthetic data sets. We lastly conclude with some theoretic underpinnings that explain why EM* is successful.
AbstractList	Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like Expectation Maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a non-linear hierarchical data structure(heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM. We show our EM algorithm outperform EM-T algorithm over large real world and synthetic data sets. We lastly conclude with some theoretic underpinnings that explain why EM* is successful.
Author	Dalkilic, Mehmet M. Kurban, Hasan Jenne, Mark
Author_xml	– sequence: 1 givenname: Hasan surname: Kurban fullname: Kurban, Hasan email: hakurban@indiana.edu organization: Comput. Sci. Dept., Indiana Univ., Bloomington, IN, USA – sequence: 2 givenname: Mark surname: Jenne fullname: Jenne, Mark email: mjenne@indiana.edu organization: Comput. Sci. Dept., Indiana Univ., Bloomington, IN, USA – sequence: 3 givenname: Mehmet M. surname: Dalkilic fullname: Dalkilic, Mehmet M. email: dalkilic@indiana.edu organization: Comput. Sci. Dept., Indiana Univ., Bloomington, IN, USA
BookMark	eNotjLtOAzEQAI0EBYR0dDT-gTu86zedyQOQEqUA6mitrIOl5A4d1_D3nATFaJrR3IjLru9YiDtQLYCKD8u3lFpU4FqjLsQ8-gBWRWVROXct5Gr7KFM3SabTsR_q-HmWpR_kUz3KJY10K64Knb55_u-Z-Fiv3hcvzWb3_LpIm6aCt2MTcjSFsWTrXPAUDHjECaNz0IgUuBwKBnNQaCiT00Q8FYygOGK2eibu_76VmfdfQz3T8LP3ProIXv8C4-s3Ow
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/DSAA.2016.40
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE/IET Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9781509052066 1509052062
EndPage	320
ExternalDocumentID	7796917
Genre	orig-research
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-i175t-8b94fe2fb56687a84172217243b8322a8efdf284d024aba63aae722e210e92b53
IEDL.DBID	RIE
IngestDate	Thu Jan 18 11:14:49 EST 2024
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i175t-8b94fe2fb56687a84172217243b8322a8efdf284d024aba63aae722e210e92b53
PageCount	9
ParticipantIDs	ieee_primary_7796917
PublicationCentury	2000
PublicationDate	2016-Oct.
PublicationDateYYYYMMDD	2016-10-01
PublicationDate_xml	– month: 10 year: 2016 text: 2016-Oct.
PublicationDecade	2010
PublicationTitle	2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
PublicationTitleAbbrev	DSAA
PublicationYear	2016
Publisher	IEEE
Publisher_xml	– name: IEEE
Score	1.6662705
Snippet	Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and,...
SourceID	ieee
SourceType	Publisher
StartPage	312
SubjectTerms	Big Data clustering Clustering algorithms Convergence Covariance matrices Data mining expectation maximization Gaussian distribution heap Iterative algorithms
Title	EM: An EM Algorithm for Big Data
URI	https://ieeexplore.ieee.org/document/7796917
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFH5sO3lS2cTf5ODRdmPp0tRbdRtDmAg62G3kJa9zqK1Id_Gv96WdU8SDp5QQaJNQvu97-V4ewAVGRiobZYFz_phRogwMKh1YkoYQWbFYH--Y3qnJLLqdD-YNuNzmwhBRZT6j0D9WZ_musGsfKuvGcaJYXjShGWtV52ptvexJd_iQpt6rpUIfyfhRK6WCivEuTL9eUjtEnsN1iaH9-HX_4n-_Yg8630l54n4LN_vQoLwNYjS9EmnOjUhflgUr_adXwTxUXK-WYmhK04HZePR4Mwk2RQ-CFSN5GWhMooz6GTLP0rHRETMMX0SKl9H_fEZT5jLGFMfgatAoaQzxCGLpRkkfB_IAWnmR0yGISNskI6YwtidZxg3QSuYXzvVJWUXYO4K2n93irb7XYrGZ2PHf3Sew4xe3NrKdQqt8X9MZA3KJ59VOfAJA64sQ
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFH5BPOhJDRh_24NHNwjtyuZtCgSVERMh4Ub64w2JOowZF_96XzdEYzx46tI02dpm-b7v9Xt9ABdaKC6NSD1r3TEj19xTWoaeQa5Qa1IsxsU7kqHsj8XdJJhU4HKdC4OIhfkMffdYnOXbhVm6UFmj3Y4kyYsN2AyEEEGZrbV2s0eNzmMcO7eW9F0s40e1lAIsejuQfL2m9Ig8-8tc--bj1w2M__2OXah_p-WxhzXg7EEFsxqwbnLF4owaFr_MFqT1n14ZMVF2PZ-xjspVHca97uim763KHnhzwvLcC3UkUmylmphW2FahII7hykjRQrrfT4WY2pRQxRK8Kq0kVwppBJJ4w6ilA74P1WyR4QEwEZooRSIxpslJyAXacGIY1rZQGom6eQg1N7vpW3mzxXQ1saO_u89hqz9KBtPB7fD-GLbdQpe2thOo5u9LPCV4zvVZsSufwbmOXQ
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2016+IEEE+International+Conference+on+Data+Science+and+Advanced+Analytics+%28DSAA%29&rft.atitle=EM%3A+An+EM+Algorithm+for+Big+Data&rft.au=Kurban%2C+Hasan&rft.au=Jenne%2C+Mark&rft.au=Dalkilic%2C+Mehmet+M.&rft.date=2016-10-01&rft.pub=IEEE&rft.spage=312&rft.epage=320&rft_id=info:doi/10.1109%2FDSAA.2016.40&rft.externalDocID=7796917