Accelerating distributed deep neural network training with pipelined MPI allreduce

TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce primitive to share information among processes, combined with a communication thread to...

Full description

Saved in:

Bibliographic Details
Published in	Cluster computing Vol. 24; no. 4; pp. 3797 - 3813
Main Authors	Castelló, Adrián, Quintana-Ortí, Enrique S., Duato, José
Format	Journal Article
Language	English
Published	New York Springer US 01.12.2021 Springer Nature B.V
Subjects	Algorithms Application programming interface Artificial neural networks Bandwidths Communication Computer Communication Networks Computer Science Energy consumption Neural networks Operating Systems Processor Architectures Synchronism Training Allreduce Deep learning Distributed training Collective communication primitives Message Passing Interface (MPI)
Online Access	Get full text
ISSN	1386-7857 1573-7543 1573-7543
DOI	10.1007/s10586-021-03370-9

Cover

More Information
Summary:	TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce primitive to share information among processes, combined with a communication thread to overlap communication with computation. In this work, we perform a thorough experimental analysis to expose (1) the importance of selecting the best algorithm in MPI libraries to realize the Allreduce operation; and (2) the performance acceleration that can be attained when replacing a blocking Allreduce with its non-blocking counterpart (while maintaining the blocking behaviour via the appropriate synchronization mechanism). Furthermore, (3) we explore the benefits of applying pipelining to the communication exchange, demonstrating that these improvements carry over to distributed training via TF+HVD. Finally, (4) we show that pipelining can also boost performance for applications that make heavy use of other collectives, such as Broadcast and Reduce-Scatter.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1386-7857 1573-7543 1573-7543
DOI:	10.1007/s10586-021-03370-9