Accelerating distributed deep neural network training with pipelined MPI allreduce

TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce primitive to share information among processes, combined with a communication thread to...

Full description

Saved in:
Bibliographic Details
Published inCluster computing Vol. 24; no. 4; pp. 3797 - 3813
Main Authors Castelló, Adrián, Quintana-Ortí, Enrique S., Duato, José
Format Journal Article
LanguageEnglish
Published New York Springer US 01.12.2021
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN1386-7857
1573-7543
1573-7543
DOI10.1007/s10586-021-03370-9

Cover

More Information
Summary:TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce primitive to share information among processes, combined with a communication thread to overlap communication with computation. In this work, we perform a thorough experimental analysis to expose (1) the importance of selecting the best algorithm in MPI libraries to realize the Allreduce operation; and (2) the performance acceleration that can be attained when replacing a blocking Allreduce with its non-blocking counterpart (while maintaining the blocking behaviour via the appropriate synchronization mechanism). Furthermore, (3) we explore the benefits of applying pipelining to the communication exchange, demonstrating that these improvements carry over to distributed training via TF+HVD. Finally, (4) we show that pipelining can also boost performance for applications that make heavy use of other collectives, such as Broadcast and Reduce-Scatter.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1386-7857
1573-7543
1573-7543
DOI:10.1007/s10586-021-03370-9