A scalable algorithm for single-linkage hierarchical clustering on distributed-memory architectures
Hierarchical clustering is a fundamental and widely-used clustering algorithm with many advantages over traditional partitional clustering. Due to the explosion in size of modern scientific datasets, there is a pressing need for scalable analytics algorithms, but good scaling is difficult to achieve...
Saved in:
| Published in | 2013 IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV) pp. 7 - 13 |
|---|---|
| Main Authors | , , , , , |
| Format | Conference Proceeding |
| Language | English |
| Published |
IEEE
01.10.2013
|
| Subjects | |
| Online Access | Get full text |
| DOI | 10.1109/LDAV.2013.6675153 |
Cover
| Abstract | Hierarchical clustering is a fundamental and widely-used clustering algorithm with many advantages over traditional partitional clustering. Due to the explosion in size of modern scientific datasets, there is a pressing need for scalable analytics algorithms, but good scaling is difficult to achieve for hierarchical clustering due to data dependencies inherent in the algorithm. To the best of our knowledge, no previous work on parallel hierarchical clustering has shown scalability beyond a couple hundred processes. In this paper, we present PINK, a scalable parallel algorithm for single-linkage hierarchical clustering based on decomposing a problem instance into two different types of subproblems. Despite the heterogeneous workloads, our algorithm exhibits good load balancing, as well as low memory requirements and a communication pattern that is both low-volume and deterministic. Evaluating PINK on up to 6050 processes, we find that it achieves speedups up to approximately 6600. |
|---|---|
| AbstractList | Hierarchical clustering is a fundamental and widely-used clustering algorithm with many advantages over traditional partitional clustering. Due to the explosion in size of modern scientific datasets, there is a pressing need for scalable analytics algorithms, but good scaling is difficult to achieve for hierarchical clustering due to data dependencies inherent in the algorithm. To the best of our knowledge, no previous work on parallel hierarchical clustering has shown scalability beyond a couple hundred processes. In this paper, we present PINK, a scalable parallel algorithm for single-linkage hierarchical clustering based on decomposing a problem instance into two different types of subproblems. Despite the heterogeneous workloads, our algorithm exhibits good load balancing, as well as low memory requirements and a communication pattern that is both low-volume and deterministic. Evaluating PINK on up to 6050 processes, we find that it achieves speedups up to approximately 6600. |
| Author | Agrawal, Ankit Hendrix, William Ali Patwary, Md Mostofa Wei-keng Liao Choudhary, Alok Palsetia, Diana |
| Author_xml | – sequence: 1 givenname: William surname: Hendrix fullname: Hendrix, William email: whendrix@northwestern.edu – sequence: 2 givenname: Diana surname: Palsetia fullname: Palsetia, Diana email: palsetia@u.northwestern.edu – sequence: 3 givenname: Md Mostofa surname: Ali Patwary fullname: Ali Patwary, Md Mostofa email: m-patwary@northwestern.edu – sequence: 4 givenname: Ankit surname: Agrawal fullname: Agrawal, Ankit email: ankitag@eecs.northwestern.edu – sequence: 5 surname: Wei-keng Liao fullname: Wei-keng Liao email: wkliao@ece.northwestern.edu – sequence: 6 givenname: Alok surname: Choudhary fullname: Choudhary, Alok email: choudhar@eecs.northwestern.edu |
| BookMark | eNot0L1OwzAYhWEjgQQtvQDE4htIsWPHsceq_EqRWCrW6rP9pbVwEmQ7Q--eCjqd5dEZ3gW5HqcRCXngbM05M0_d8-ZrXTMu1kq1DW_EFVlw2RrDVWOaW7LKOVhWq1ZJLvUdcRuaHUSwESnEw5RCOQ60nxLNYTxErGIYv-GA9BgwQXLHcNbUxTkXTGdBp5H6kEsKdi7oqwGHKZ3onyzoypww35ObHmLG1WWXZPf6stu-V93n28d201XBsFJJCcC9ltpC34NU3NdWOK96zVvU0CqUbd14kEYxMI45LWvhas-ssB69EUvy-H8bEHH_k8IA6bS_ZBC_OCpYRg |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/LDAV.2013.6675153 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 1479916595 9781479916597 |
| EndPage | 13 |
| ExternalDocumentID | 6675153 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IL ALMA_UNASSIGNED_HOLDINGS CBEJK RIB RIC RIE RIL |
| ID | FETCH-LOGICAL-i90t-44aa1d848baffa461d2b3cd6f817e8a76e4725da4960a9c0c8423c2d0b3bded93 |
| IEDL.DBID | RIE |
| IngestDate | Wed Dec 20 05:18:54 EST 2023 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i90t-44aa1d848baffa461d2b3cd6f817e8a76e4725da4960a9c0c8423c2d0b3bded93 |
| PageCount | 7 |
| ParticipantIDs | ieee_primary_6675153 |
| PublicationCentury | 2000 |
| PublicationDate | 2013-Oct. |
| PublicationDateYYYYMMDD | 2013-10-01 |
| PublicationDate_xml | – month: 10 year: 2013 text: 2013-Oct. |
| PublicationDecade | 2010 |
| PublicationTitle | 2013 IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV) |
| PublicationTitleAbbrev | LDAV |
| PublicationYear | 2013 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssib026764148 |
| Score | 1.6261497 |
| Snippet | Hierarchical clustering is a fundamental and widely-used clustering algorithm with many advantages over traditional partitional clustering. Due to the... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 7 |
| SubjectTerms | Arrays Clustering algorithms D.1.3 [Programming Techniques]: Concurrent Programming-Parallel Programming Educational institutions I.5.3 [Information Systems Applications]: Clustering-Algorithms Merging Parallel algorithms Partitioning algorithms Vegetation |
| Title | A scalable algorithm for single-linkage hierarchical clustering on distributed-memory architectures |
| URI | https://ieeexplore.ieee.org/document/6675153 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT4NAEN60PXlS0xrf2YNHQWCXZTk2atMYazxU01uzj0GNLRgLB_317gDWRzx4I2QIZBaYmf2-mY-QE5MkzNqIeVKngcdj5T4pgSICUloZactZPbtzciPGd_xqFs865HTdCwMANfkMfDyssXxbmAq3ylC5z4Vf1iXdRIqmV-vz3YlEIrhL7VvgMgzSs-uL4T1yt5jfXvdDQKWOH6NNMvm8c0MbefarUvvm_ddQxv8-2hYZfHXq0dt1DNomHcj7xAzpyrkem6KoWjwUrv5_XFKXnVLcGFiAh6it-49Q1MGukQRnTc2iwqEJzoIWObU4UBe1sMB6S2TjvtHvmMNqQKajy-n52GvFFLynNCg9zpUKreRSqyxTXIQ20sxYkckwAakSATyJYqu4q2hUagIjXZ5lIhtopi3YlO2QXl7ksEtoqBgkGmtqSLkzTsPIylgJiDPIAsX3SB_9M39pxmXMW9fs_336gGzgGjX8uEPSK18rOHJxvtTH9QJ_ACv0qxc |
| linkProvider | IEEE |
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKGWAC1CLeeGAkIYkdxxkroCrQVgwFdav8ClS0CWqTAX49viSUhxjYouiiROfYd_Z9930InakoIloHxOEy9hwaCjulGIgIcK55IDUlJXfnYMh6D_R2HI4b6HzVC2OMKcFnxoXLspavM1XAURko99nwS9bQekgpDaturc-_J2ARoza5r0uXvhdf9K86j4DeIm795A8JlTKCdLfQ4PPdFXDkxS1y6ar3X7SM__24bdT-6tXD96sotIMaJm0h1cFL63xoi8Ji9pQtpvnzHNv8FMPRwMw4ULe1KwkGJeyylmCtsZoVQJtgLXCWYg2UuqCGZbQzBzzuG_5edVi20ah7PbrsObWcgjONvdyhVAhfc8qlSBJBma8DSZRmCfcjw0XEDI2CUAtq9zQiVp7iNtNSgfYkkdromOyiZpqlZg9hXxATSdhVm5ha49gPNA8FM2FiEk_QfdQC_0xeK8KMSe2ag79vn6KN3mjQn_RvhneHaBPGq0LLHaFmvijMsY36uTwpB_sDTAuuZA |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2013+IEEE+Symposium+on+Large-Scale+Data+Analysis+and+Visualization+%28LDAV%29&rft.atitle=A+scalable+algorithm+for+single-linkage+hierarchical+clustering+on+distributed-memory+architectures&rft.au=Hendrix%2C+William&rft.au=Palsetia%2C+Diana&rft.au=Ali+Patwary%2C+Md+Mostofa&rft.au=Agrawal%2C+Ankit&rft.date=2013-10-01&rft.pub=IEEE&rft.spage=7&rft.epage=13&rft_id=info:doi/10.1109%2FLDAV.2013.6675153&rft.externalDocID=6675153 |