A scalable algorithm for single-linkage hierarchical clustering on distributed-memory architectures

Hierarchical clustering is a fundamental and widely-used clustering algorithm with many advantages over traditional partitional clustering. Due to the explosion in size of modern scientific datasets, there is a pressing need for scalable analytics algorithms, but good scaling is difficult to achieve...

Full description

Saved in:
Bibliographic Details
Published in2013 IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV) pp. 7 - 13
Main Authors Hendrix, William, Palsetia, Diana, Ali Patwary, Md Mostofa, Agrawal, Ankit, Wei-keng Liao, Choudhary, Alok
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.10.2013
Subjects
Online AccessGet full text
DOI10.1109/LDAV.2013.6675153

Cover

Abstract Hierarchical clustering is a fundamental and widely-used clustering algorithm with many advantages over traditional partitional clustering. Due to the explosion in size of modern scientific datasets, there is a pressing need for scalable analytics algorithms, but good scaling is difficult to achieve for hierarchical clustering due to data dependencies inherent in the algorithm. To the best of our knowledge, no previous work on parallel hierarchical clustering has shown scalability beyond a couple hundred processes. In this paper, we present PINK, a scalable parallel algorithm for single-linkage hierarchical clustering based on decomposing a problem instance into two different types of subproblems. Despite the heterogeneous workloads, our algorithm exhibits good load balancing, as well as low memory requirements and a communication pattern that is both low-volume and deterministic. Evaluating PINK on up to 6050 processes, we find that it achieves speedups up to approximately 6600.
AbstractList Hierarchical clustering is a fundamental and widely-used clustering algorithm with many advantages over traditional partitional clustering. Due to the explosion in size of modern scientific datasets, there is a pressing need for scalable analytics algorithms, but good scaling is difficult to achieve for hierarchical clustering due to data dependencies inherent in the algorithm. To the best of our knowledge, no previous work on parallel hierarchical clustering has shown scalability beyond a couple hundred processes. In this paper, we present PINK, a scalable parallel algorithm for single-linkage hierarchical clustering based on decomposing a problem instance into two different types of subproblems. Despite the heterogeneous workloads, our algorithm exhibits good load balancing, as well as low memory requirements and a communication pattern that is both low-volume and deterministic. Evaluating PINK on up to 6050 processes, we find that it achieves speedups up to approximately 6600.
Author Agrawal, Ankit
Hendrix, William
Ali Patwary, Md Mostofa
Wei-keng Liao
Choudhary, Alok
Palsetia, Diana
Author_xml – sequence: 1
  givenname: William
  surname: Hendrix
  fullname: Hendrix, William
  email: whendrix@northwestern.edu
– sequence: 2
  givenname: Diana
  surname: Palsetia
  fullname: Palsetia, Diana
  email: palsetia@u.northwestern.edu
– sequence: 3
  givenname: Md Mostofa
  surname: Ali Patwary
  fullname: Ali Patwary, Md Mostofa
  email: m-patwary@northwestern.edu
– sequence: 4
  givenname: Ankit
  surname: Agrawal
  fullname: Agrawal, Ankit
  email: ankitag@eecs.northwestern.edu
– sequence: 5
  surname: Wei-keng Liao
  fullname: Wei-keng Liao
  email: wkliao@ece.northwestern.edu
– sequence: 6
  givenname: Alok
  surname: Choudhary
  fullname: Choudhary, Alok
  email: choudhar@eecs.northwestern.edu
BookMark eNot0L1OwzAYhWEjgQQtvQDE4htIsWPHsceq_EqRWCrW6rP9pbVwEmQ7Q--eCjqd5dEZ3gW5HqcRCXngbM05M0_d8-ZrXTMu1kq1DW_EFVlw2RrDVWOaW7LKOVhWq1ZJLvUdcRuaHUSwESnEw5RCOQ60nxLNYTxErGIYv-GA9BgwQXLHcNbUxTkXTGdBp5H6kEsKdi7oqwGHKZ3onyzoypww35ObHmLG1WWXZPf6stu-V93n28d201XBsFJJCcC9ltpC34NU3NdWOK96zVvU0CqUbd14kEYxMI45LWvhas-ssB69EUvy-H8bEHH_k8IA6bS_ZBC_OCpYRg
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/LDAV.2013.6675153
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 1479916595
9781479916597
EndPage 13
ExternalDocumentID 6675153
Genre orig-research
GroupedDBID 6IE
6IL
ALMA_UNASSIGNED_HOLDINGS
CBEJK
RIB
RIC
RIE
RIL
ID FETCH-LOGICAL-i90t-44aa1d848baffa461d2b3cd6f817e8a76e4725da4960a9c0c8423c2d0b3bded93
IEDL.DBID RIE
IngestDate Wed Dec 20 05:18:54 EST 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i90t-44aa1d848baffa461d2b3cd6f817e8a76e4725da4960a9c0c8423c2d0b3bded93
PageCount 7
ParticipantIDs ieee_primary_6675153
PublicationCentury 2000
PublicationDate 2013-Oct.
PublicationDateYYYYMMDD 2013-10-01
PublicationDate_xml – month: 10
  year: 2013
  text: 2013-Oct.
PublicationDecade 2010
PublicationTitle 2013 IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV)
PublicationTitleAbbrev LDAV
PublicationYear 2013
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib026764148
Score 1.6261497
Snippet Hierarchical clustering is a fundamental and widely-used clustering algorithm with many advantages over traditional partitional clustering. Due to the...
SourceID ieee
SourceType Publisher
StartPage 7
SubjectTerms Arrays
Clustering algorithms
D.1.3 [Programming Techniques]: Concurrent Programming-Parallel Programming
Educational institutions
I.5.3 [Information Systems Applications]: Clustering-Algorithms
Merging
Parallel algorithms
Partitioning algorithms
Vegetation
Title A scalable algorithm for single-linkage hierarchical clustering on distributed-memory architectures
URI https://ieeexplore.ieee.org/document/6675153
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT4NAEN60PXlS0xrf2YNHQWCXZTk2atMYazxU01uzj0GNLRgLB_317gDWRzx4I2QIZBaYmf2-mY-QE5MkzNqIeVKngcdj5T4pgSICUloZactZPbtzciPGd_xqFs865HTdCwMANfkMfDyssXxbmAq3ylC5z4Vf1iXdRIqmV-vz3YlEIrhL7VvgMgzSs-uL4T1yt5jfXvdDQKWOH6NNMvm8c0MbefarUvvm_ddQxv8-2hYZfHXq0dt1DNomHcj7xAzpyrkem6KoWjwUrv5_XFKXnVLcGFiAh6it-49Q1MGukQRnTc2iwqEJzoIWObU4UBe1sMB6S2TjvtHvmMNqQKajy-n52GvFFLynNCg9zpUKreRSqyxTXIQ20sxYkckwAakSATyJYqu4q2hUagIjXZ5lIhtopi3YlO2QXl7ksEtoqBgkGmtqSLkzTsPIylgJiDPIAsX3SB_9M39pxmXMW9fs_336gGzgGjX8uEPSK18rOHJxvtTH9QJ_ACv0qxc
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKGWAC1CLeeGAkIYkdxxkroCrQVgwFdav8ClS0CWqTAX49viSUhxjYouiiROfYd_Z9930InakoIloHxOEy9hwaCjulGIgIcK55IDUlJXfnYMh6D_R2HI4b6HzVC2OMKcFnxoXLspavM1XAURko99nwS9bQekgpDaturc-_J2ARoza5r0uXvhdf9K86j4DeIm795A8JlTKCdLfQ4PPdFXDkxS1y6ar3X7SM__24bdT-6tXD96sotIMaJm0h1cFL63xoi8Ji9pQtpvnzHNv8FMPRwMw4ULe1KwkGJeyylmCtsZoVQJtgLXCWYg2UuqCGZbQzBzzuG_5edVi20ah7PbrsObWcgjONvdyhVAhfc8qlSBJBma8DSZRmCfcjw0XEDI2CUAtq9zQiVp7iNtNSgfYkkdromOyiZpqlZg9hXxATSdhVm5ha49gPNA8FM2FiEk_QfdQC_0xeK8KMSe2ag79vn6KN3mjQn_RvhneHaBPGq0LLHaFmvijMsY36uTwpB_sDTAuuZA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2013+IEEE+Symposium+on+Large-Scale+Data+Analysis+and+Visualization+%28LDAV%29&rft.atitle=A+scalable+algorithm+for+single-linkage+hierarchical+clustering+on+distributed-memory+architectures&rft.au=Hendrix%2C+William&rft.au=Palsetia%2C+Diana&rft.au=Ali+Patwary%2C+Md+Mostofa&rft.au=Agrawal%2C+Ankit&rft.date=2013-10-01&rft.pub=IEEE&rft.spage=7&rft.epage=13&rft_id=info:doi/10.1109%2FLDAV.2013.6675153&rft.externalDocID=6675153