Optimization-Based Block Coordinate Gradient Coding for Mitigating Partial Stragglers in Distributed Learning

Gradient coding schemes effectively mitigate full stragglers in distributed learning by introducing identical redundancy in coded local partial derivatives corresponding to all model parameters. However, they are no longer effective for partial stragglers as they cannot utilize incomplete computatio...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on signal processing Vol. 71; pp. 1023 - 1038
Main Authors Wang, Qi, Cui, Ying, Li, Chenglin, Zou, Junni, Xiong, Hongkai
Format Journal Article
LanguageEnglish
Published New York IEEE 2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN1053-587X
1941-0476
DOI10.1109/TSP.2023.3244084

Cover

More Information
Summary:Gradient coding schemes effectively mitigate full stragglers in distributed learning by introducing identical redundancy in coded local partial derivatives corresponding to all model parameters. However, they are no longer effective for partial stragglers as they cannot utilize incomplete computation results from partial stragglers. This paper aims to design a new gradient coding scheme for mitigating partial stragglers in distributed learning. Specifically, we consider a distributed system consisting of one master and <inline-formula><tex-math notation="LaTeX">N</tex-math></inline-formula> workers, characterized by a general partial straggler model and focuses on solving a general large-scale machine learning problem with <inline-formula><tex-math notation="LaTeX">L</tex-math></inline-formula> model parameters using gradient coding. First, we propose a coordinate gradient coding scheme with <inline-formula><tex-math notation="LaTeX">L</tex-math></inline-formula> coding parameters representing <inline-formula><tex-math notation="LaTeX">L</tex-math></inline-formula> possibly different diversities for the <inline-formula><tex-math notation="LaTeX">L</tex-math></inline-formula> coordinates, which generates most gradient coding schemes. Then, we consider the minimization of the expected overall runtime and the maximization of the completion probability with respect to the <inline-formula><tex-math notation="LaTeX">L</tex-math></inline-formula> coding parameters for coordinates, which are challenging discrete optimization problems. To reduce computational complexity, we first transform each to an equivalent but much simpler discrete problem with <inline-formula><tex-math notation="LaTeX">N \ll L</tex-math></inline-formula> variables representing the partition of the <inline-formula><tex-math notation="LaTeX">L</tex-math></inline-formula> coordinates into <inline-formula><tex-math notation="LaTeX">N</tex-math></inline-formula> blocks, each with identical redundancy. This indicates an equivalent but more easily implemented block coordinate gradient coding scheme with <inline-formula><tex-math notation="LaTeX">N</tex-math></inline-formula> coding parameters for blocks. Then, we adopt continuous relaxation to further reduce computational complexity. For the resulting minimization of expected overall runtime, we develop an iterative algorithm of computational complexity <inline-formula><tex-math notation="LaTeX">\mathcal {O}(N^{2})</tex-math></inline-formula> to obtain an optimal solution and derive two closed-form approximate solutions both with computational complexity <inline-formula><tex-math notation="LaTeX">\mathcal {O}(N)</tex-math></inline-formula>. For the resultant maximization of the completion probability, we develop an iterative algorithm of computational complexity <inline-formula><tex-math notation="LaTeX">\mathcal {O}(N^{2})</tex-math></inline-formula> to obtain a stationary point and derive a closed-form approximate solution with computational complexity <inline-formula><tex-math notation="LaTeX">\mathcal {O}(N)</tex-math></inline-formula> at a large threshold. Finally, numerical results show that the proposed solutions significantly outperform existing coded computation schemes and their extensions.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1053-587X
1941-0476
DOI:10.1109/TSP.2023.3244084