Staleness Analysis in Asynchronous Optimization

Distributed optimization is widely used to solve large-scale optimization problems by parallelizing gradient-based algorithms across multiple computing nodes. In asynchronous optimization, the optimization parameter is updated using stale gradients, which are gradients computed with respect to outda...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on signal and information processing over networks Vol. 8; pp. 404 - 420
Main Authors	Al-Lawati, Haider, Draper, Stark
Format	Journal Article
Language	English
Published	Piscataway IEEE 2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Asynchronous optimization Computer architecture Computer memory Convergence Delays distributed algorit- hms gradient staleness Linear programming Machine learning algorithms Memory architecture Optimization Parallel processing Parameters Signal processing algorithms Staling
Online Access	Get full text
ISSN	2373-776X 2373-7778
DOI	10.1109/TSIPN.2022.3163931

Cover

More Information
Summary:	Distributed optimization is widely used to solve large-scale optimization problems by parallelizing gradient-based algorithms across multiple computing nodes. In asynchronous optimization, the optimization parameter is updated using stale gradients, which are gradients computed with respect to outdated parameters. Although large degrees of staleness can slow convergence, little is known about the impact of staleness and its relation to other system parameters. In this work, we analyze asynchronous optimization when implemented using either hub-and-spoke or shared memory architectures. We show that the process of gradient arrival to the master node is similar in nature to a renewal process. We derive the bandwidth requirement of the system. For the hub-and-spoke setup, we derive bounds on the expected gradient staleness and show its connection to other system parameters such as the number of workers, expected compute time, and communication delays. Our derivations reveal that it is possible to adjust gradient staleness by tuning certain parameters such as minibatch size or the number of workers. For the shared memory architecture, we show that the expected staleness is equivalent to the number of workers. Our derivations can be used in existing convergence analyses to express convergence rates in terms of other known system parameters. Such an expression gives further details on what factors impact convergence.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2373-776X 2373-7778
DOI:	10.1109/TSIPN.2022.3163931