Optimization-Based Block Coordinate Gradient Coding for Mitigating Partial Stragglers in Distributed Learning
Gradient coding schemes effectively mitigate full stragglers in distributed learning by introducing identical redundancy in coded local partial derivatives corresponding to all model parameters. However, they are no longer effective for partial stragglers as they cannot utilize incomplete computatio...
Saved in:
| Published in | IEEE transactions on signal processing Vol. 71; pp. 1023 - 1038 |
|---|---|
| Main Authors | , , , , |
| Format | Journal Article |
| Language | English |
| Published |
New York
IEEE
2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Subjects | |
| Online Access | Get full text |
| ISSN | 1053-587X 1941-0476 |
| DOI | 10.1109/TSP.2023.3244084 |
Cover
| Summary: | Gradient coding schemes effectively mitigate full stragglers in distributed learning by introducing identical redundancy in coded local partial derivatives corresponding to all model parameters. However, they are no longer effective for partial stragglers as they cannot utilize incomplete computation results from partial stragglers. This paper aims to design a new gradient coding scheme for mitigating partial stragglers in distributed learning. Specifically, we consider a distributed system consisting of one master and <inline-formula><tex-math notation="LaTeX">N</tex-math></inline-formula> workers, characterized by a general partial straggler model and focuses on solving a general large-scale machine learning problem with <inline-formula><tex-math notation="LaTeX">L</tex-math></inline-formula> model parameters using gradient coding. First, we propose a coordinate gradient coding scheme with <inline-formula><tex-math notation="LaTeX">L</tex-math></inline-formula> coding parameters representing <inline-formula><tex-math notation="LaTeX">L</tex-math></inline-formula> possibly different diversities for the <inline-formula><tex-math notation="LaTeX">L</tex-math></inline-formula> coordinates, which generates most gradient coding schemes. Then, we consider the minimization of the expected overall runtime and the maximization of the completion probability with respect to the <inline-formula><tex-math notation="LaTeX">L</tex-math></inline-formula> coding parameters for coordinates, which are challenging discrete optimization problems. To reduce computational complexity, we first transform each to an equivalent but much simpler discrete problem with <inline-formula><tex-math notation="LaTeX">N \ll L</tex-math></inline-formula> variables representing the partition of the <inline-formula><tex-math notation="LaTeX">L</tex-math></inline-formula> coordinates into <inline-formula><tex-math notation="LaTeX">N</tex-math></inline-formula> blocks, each with identical redundancy. This indicates an equivalent but more easily implemented block coordinate gradient coding scheme with <inline-formula><tex-math notation="LaTeX">N</tex-math></inline-formula> coding parameters for blocks. Then, we adopt continuous relaxation to further reduce computational complexity. For the resulting minimization of expected overall runtime, we develop an iterative algorithm of computational complexity <inline-formula><tex-math notation="LaTeX">\mathcal {O}(N^{2})</tex-math></inline-formula> to obtain an optimal solution and derive two closed-form approximate solutions both with computational complexity <inline-formula><tex-math notation="LaTeX">\mathcal {O}(N)</tex-math></inline-formula>. For the resultant maximization of the completion probability, we develop an iterative algorithm of computational complexity <inline-formula><tex-math notation="LaTeX">\mathcal {O}(N^{2})</tex-math></inline-formula> to obtain a stationary point and derive a closed-form approximate solution with computational complexity <inline-formula><tex-math notation="LaTeX">\mathcal {O}(N)</tex-math></inline-formula> at a large threshold. Finally, numerical results show that the proposed solutions significantly outperform existing coded computation schemes and their extensions. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 1053-587X 1941-0476 |
| DOI: | 10.1109/TSP.2023.3244084 |