SUARA: A scalable universal allreduce communication algorithm for acceleration of parallel deep learning applications
Parallel and distributed deep learning (PDNN) has become an effective strategy to reduce the long training times of large-scale deep neural networks. Mainstream PDNN software packages based on the message-passing interface (MPI) and employing synchronous stochastic gradient descent rely crucially on...
Saved in:
| Published in | Journal of parallel and distributed computing Vol. 183; p. 104767 |
|---|---|
| Main Authors | , , , , |
| Format | Journal Article |
| Language | English |
| Published |
Elsevier Inc
01.01.2024
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 0743-7315 1096-0848 1096-0848 |
| DOI | 10.1016/j.jpdc.2023.104767 |
Cover
| Summary: | Parallel and distributed deep learning (PDNN) has become an effective strategy to reduce the long training times of large-scale deep neural networks. Mainstream PDNN software packages based on the message-passing interface (MPI) and employing synchronous stochastic gradient descent rely crucially on the performance of MPI allreduce collective communication routine.
In this work, we propose a novel scalable universal allreduce meta-algorithm called SUARA. In general, SUARA consists of L serial steps, where L≥2, executed by all MPI processes involved in the allreduce operation. At each step, SUARA partitions this set of processes into subsets, which execute optimally selected library allreduce algorithms to solve sub-allreduce problems on these subsets in parallel, to accomplish the whole allreduce operation after completing all the L steps. We then design, theoretically study and implement a two-step SUARA (L=2) called SUARA2 on top of the Open MPI library. We prove that the theoretical asymptotic speedup of SUARA2 executed by P processes over the base Open MPI routine is O(P). Our experiments on Shaheen-II supercomputer employing 1024 nodes demonstrate over 2x speedup of SUARA2 over native Open MPI allreduce routine, which translates into the performance improvement of training of ResNet-50 DNN on ImageNet by 9%.
•A novel scalable universal allreduce collective algorithm called SUARA.•An optimized Open MPI SUARA implementation, SUARA2, with speedup O(P).•2x practical speedup of SUARA2 over native Open MPI allreduce for P=1024 processes.•Performance improvement of training of ResNet-50 DNN on ImageNet by 9% using SUARA2. |
|---|---|
| ISSN: | 0743-7315 1096-0848 1096-0848 |
| DOI: | 10.1016/j.jpdc.2023.104767 |