SUARA: A scalable universal allreduce communication algorithm for acceleration of parallel deep learning applications

Parallel and distributed deep learning (PDNN) has become an effective strategy to reduce the long training times of large-scale deep neural networks. Mainstream PDNN software packages based on the message-passing interface (MPI) and employing synchronous stochastic gradient descent rely crucially on...

Full description

Saved in:
Bibliographic Details
Published inJournal of parallel and distributed computing Vol. 183; p. 104767
Main Authors Nuriyev, Emin, Manumachu, Ravi Reddy, Aseeri, Samar, Verma, Mahendra K., Lastovetsky, Alexey L.
Format Journal Article
LanguageEnglish
Published Elsevier Inc 01.01.2024
Subjects
Online AccessGet full text
ISSN0743-7315
1096-0848
1096-0848
DOI10.1016/j.jpdc.2023.104767

Cover

More Information
Summary:Parallel and distributed deep learning (PDNN) has become an effective strategy to reduce the long training times of large-scale deep neural networks. Mainstream PDNN software packages based on the message-passing interface (MPI) and employing synchronous stochastic gradient descent rely crucially on the performance of MPI allreduce collective communication routine. In this work, we propose a novel scalable universal allreduce meta-algorithm called SUARA. In general, SUARA consists of L serial steps, where L≥2, executed by all MPI processes involved in the allreduce operation. At each step, SUARA partitions this set of processes into subsets, which execute optimally selected library allreduce algorithms to solve sub-allreduce problems on these subsets in parallel, to accomplish the whole allreduce operation after completing all the L steps. We then design, theoretically study and implement a two-step SUARA (L=2) called SUARA2 on top of the Open MPI library. We prove that the theoretical asymptotic speedup of SUARA2 executed by P processes over the base Open MPI routine is O(P). Our experiments on Shaheen-II supercomputer employing 1024 nodes demonstrate over 2x speedup of SUARA2 over native Open MPI allreduce routine, which translates into the performance improvement of training of ResNet-50 DNN on ImageNet by 9%. •A novel scalable universal allreduce collective algorithm called SUARA.•An optimized Open MPI SUARA implementation, SUARA2, with speedup O(P).•2x practical speedup of SUARA2 over native Open MPI allreduce for P=1024 processes.•Performance improvement of training of ResNet-50 DNN on ImageNet by 9% using SUARA2.
ISSN:0743-7315
1096-0848
1096-0848
DOI:10.1016/j.jpdc.2023.104767