Efficient shared memory and RDMA based collectives on multi-rail QsNetII SMP clusters

Clusters of Symmetric Multiprocessors (SMP) are more commonplace than ever in achieving high-performance. Scientific applications running on clusters employ collective communications extensively. Shared memory communication and Remote Direct Memory Access (RDMA) over multi-rail networks are promisin...

Full description

Saved in:

Bibliographic Details
Published in	Cluster computing Vol. 11; no. 4; pp. 341 - 354
Main Authors	Qian, Ying, Afsahi, Ahmad
Format	Journal Article
Language	English
Published	Boston Springer US 01.12.2008 Springer Nature B.V
Subjects	Algorithms Clusters Communication Computer Communication Networks Computer Science Messages Multiprocessing Operating Systems Processor Architectures Rail transportation RDMA Clusters Collective communications Shared-memory Multi-rail networks SMP
Online Access	Get full text
ISSN	1386-7857 1573-7543
DOI	10.1007/s10586-008-0065-8

Cover

More Information
Summary:	Clusters of Symmetric Multiprocessors (SMP) are more commonplace than ever in achieving high-performance. Scientific applications running on clusters employ collective communications extensively. Shared memory communication and Remote Direct Memory Access (RDMA) over multi-rail networks are promising approaches in addressing the increasing demand on intra-node and inter-node communications, and thereby in boosting the performance of collectives in emerging multi-core SMP clusters. In this regard, this paper designs and evaluates two classes of collective communication algorithms directly at the Elan user-level over multi-rail Quadrics QsNet II with message striping: 1) RDMA-based traditional multi-port algorithms for gather, all-gather, and all-to-all collectives for medium to large messages, and 2) RDMA-based and SMP-aware multi-port all-gather algorithms for small to medium size messages. The multi-port RDMA-based Direct algorithm for gather and all-to-all collectives gain an improvement of up to 2.15 for 4 KB messages over elan _ gather() , and up to 2.26 for 2 KB messages over elan _ alltoall() , respectively. For the all-gather, our SMP-aware Bruck algorithm outperforms all other all-gather algorithms including elan _ gather() for 512 B to 8 KB messages, with a 1.96 improvement factor for 4 KB messages. Our multi-port Direct all-gather is the best algorithm for 16 KB to 1 MB, and outperforms elan _ gather() by a factor of 1.49 for 32 KB messages. Experimentation with real applications has shown up to 1.47 communication speedup can be achieved using the proposed all-gather algorithms.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1386-7857 1573-7543
DOI:	10.1007/s10586-008-0065-8