A Model-Based Approach to Streamlining Distributed Training for Asynchronous SGD

The success of Deep Neural Networks (DNNs) has created significant interest in the development of software tools, hardware architectures, and cloud systems to meet the huge computational demand of their training jobs. A common approach to speeding up an individual job is to distribute training data...

Full description

Saved in:

Bibliographic Details
Published in	2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) pp. 306 - 318
Main Authors	Sung-Han Lin, Paolieri, Marco, Cheng-Fu Chou, Golubchik, Leana
Format	Conference Proceeding
Language	English
Published	IEEE 01.09.2018
Subjects	Computational modeling Distributed Machine Learning Load modeling Machine learning Parallel Schedul ing Queueing Networks Servers TensorFlow Throughput Time factors Training
Online Access	Get full text
ISSN	2375-0227
DOI	10.1109/MASCOTS.2018.00037

Cover

Abstract	The success of Deep Neural Networks (DNNs) has created significant interest in the development of software tools, hardware architectures, and cloud systems to meet the huge computational demand of their training jobs. A common approach to speeding up an individual job is to distribute training data and computation among multiple nodes, periodically exchanging intermediate results. In this paper, we address two important problems for the application of this strategy to large-scale clusters and multiple, heterogeneous jobs. First, we propose and validate a queueing model to estimate the throughput of a training job as a function of the number of nodes assigned to the job; this model targets asynchronous Stochastic Gradient Descent (SGD), a popular strategy for distributed training, and requires only data from quick, two-node profiling in addition to job characteristics (number of requested training epochs, mini-batch size, size of DNN parameters, assigned bandwidth). Throughput estimations are then used to explore several classes of scheduling heuristics to reduce response time in a scenario where heterogeneous jobs are continuously submitted to a large-scale cluster. These scheduling algorithms dynamically select which jobs to run and how many nodes to assign to each job, based on different trade-offs between service time reduction and efficiency (e.g., speedup per additional node). Heuristics are evaluated through extensive simulations of realistic DNN workloads, also investigating the effects of early termination, a common scenario for DNN training jobs.
AbstractList	The success of Deep Neural Networks (DNNs) has created significant interest in the development of software tools, hardware architectures, and cloud systems to meet the huge computational demand of their training jobs. A common approach to speeding up an individual job is to distribute training data and computation among multiple nodes, periodically exchanging intermediate results. In this paper, we address two important problems for the application of this strategy to large-scale clusters and multiple, heterogeneous jobs. First, we propose and validate a queueing model to estimate the throughput of a training job as a function of the number of nodes assigned to the job; this model targets asynchronous Stochastic Gradient Descent (SGD), a popular strategy for distributed training, and requires only data from quick, two-node profiling in addition to job characteristics (number of requested training epochs, mini-batch size, size of DNN parameters, assigned bandwidth). Throughput estimations are then used to explore several classes of scheduling heuristics to reduce response time in a scenario where heterogeneous jobs are continuously submitted to a large-scale cluster. These scheduling algorithms dynamically select which jobs to run and how many nodes to assign to each job, based on different trade-offs between service time reduction and efficiency (e.g., speedup per additional node). Heuristics are evaluated through extensive simulations of realistic DNN workloads, also investigating the effects of early termination, a common scenario for DNN training jobs.
Author	Cheng-Fu Chou Golubchik, Leana Sung-Han Lin Paolieri, Marco
Author_xml	– sequence: 1 surname: Sung-Han Lin fullname: Sung-Han Lin email: sunghan@usc.edu organization: Dept. of Comput. Sci., Univ. of Southern California, Los Angeles, CA, USA – sequence: 2 givenname: Marco surname: Paolieri fullname: Paolieri, Marco email: paolieri@usc.edu organization: Dept. of Comput. Sci., Univ. of Southern California, Los Angeles, CA, USA – sequence: 3 surname: Cheng-Fu Chou fullname: Cheng-Fu Chou email: ccf@csie.ntu.edu.tw organization: NetApp, Inc., Sunnyvale, CA, USA – sequence: 4 givenname: Leana surname: Golubchik fullname: Golubchik, Leana email: leana@usc.edu organization: Dept. of Comput. Sci., Univ. of Southern California, Los Angeles, CA, USA
BookMark	eNotjN1OwjAYQKvRREBeQG_6Apttv3XtLicomkAw2bwmXX-kZqykHRe8vSR4dZKTkzNFd0MYLEJPlOSUkuplUzeLbdvkjFCZE0JA3KB5JSTlIMtSyhJu0YSB4BlhTDygaUq_hFxqDhP0VeNNMLbPXlWyBtfHYwxK7_EYcDNGqw69H_zwg5c-jdF3p_EStVFdpQsR1-k86H0MQzgl3KyWj-jeqT7Z-T9n6Pv9rV18ZOvt6nNRrzNPBR8z1zFrJC2FMGCLDjonHFSl0s5qSnWhpBOFrminQVcSOBiteAHKKKpMxxzM0PP16621u2P0BxXPO8lZKSsOf-OBUpU
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/MASCOTS.2018.00037
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	9781538668863 1538668866
EISSN	2375-0227
EndPage	318
ExternalDocumentID	8526895
Genre	orig-research
GroupedDBID	29O 6IE 6IK 6IL AAJGR ACGFS ALMA_UNASSIGNED_HOLDINGS CBEJK M43 RIE RIL RNS
ID	FETCH-LOGICAL-i175t-fb2ed81677d3e4b3bf7f396acfec11c4a8f74c91bc3c98353dca543ada1adb2f3
IEDL.DBID	RIE
IngestDate	Wed Aug 27 02:57:05 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i175t-fb2ed81677d3e4b3bf7f396acfec11c4a8f74c91bc3c98353dca543ada1adb2f3
PageCount	13
ParticipantIDs	ieee_primary_8526895
PublicationCentury	2000
PublicationDate	2018-Sept.
PublicationDateYYYYMMDD	2018-09-01
PublicationDate_xml	– month: 09 year: 2018 text: 2018-Sept.
PublicationDecade	2010
PublicationTitle	2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)
PublicationTitleAbbrev	MASCOT
PublicationYear	2018
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0020153 ssj0002685587
Score	2.2227306
Snippet	The success of Deep Neural Networks (DNNs) has created significant interest in the development of software tools, hardware architectures, and cloud systems to...
SourceID	ieee
SourceType	Publisher
StartPage	306
SubjectTerms	Computational modeling Distributed Machine Learning Load modeling Machine learning Parallel Schedul ing Queueing Networks Servers TensorFlow Throughput Time factors Training
Title	A Model-Based Approach to Streamlining Distributed Training for Asynchronous SGD
URI	https://ieeexplore.ieee.org/document/8526895
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LS8MwGA_bTp6mbuKbHDyarW2SpjlW5xzCZLANdht5gjhXce1B_3qTtttQPHgolKaUkKR835f8HgDcKMYZ5sQiwxKCSGgxEpRSJAJN3RUHinru8Pg5Hs3J04IuGuB2x4UxxpTgM9Pzt-VZvs5U4bfK-onXJuG0CZpumVVcrd1-imuhNNkXWy7M4S1JJuD9cTp1NfLUY7k8eDLAP-1UymgybIPxth8ViOS1V-Syp75-STT-t6OHoLvn7cHJLiIdgYZZH4P21rgB1v9xB0xS6D3QVujOxTAN01pXHOYZ9KfU4m1V2kbAgVfV9YZY7qVZ7SUBXZYL083nWnlZ3azYwOnjoAvmw4fZ_QjV1groxeULObIyMjoJY8Y0NkRiaZnFPBbKGhWGiojEMqJ4KBVW3CVpWCtBCRZahELLyOIT0Fpna3MKYODWAJY8YrFgxIaJK-Dc15Qruwl3uUd0Bjp-gJbvlXrGsh6b878fX4ADP0UViusStPKPwly5sJ_L63K-vwFQlqzd
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwGG0QD3pCBeNve_DoYKPtuh6nqKiMkAAJN9J2bWLEzch20L_edhsQjQcPS5atWZp2zfe-9n3vAXAlKaOIYe0oGmAHexo5nBDicDcm5vJdSWztcDT0-1P8NCOzGrhe18IopQrymWrb2-IsP05lbrfKOoHVJmFkC2wTk1UEZbXWekfFvCMk2KRbJtChVZmMyzpRODZZ8tiyuSx90kU_DVWKeHLfANGqJyWN5LWdZ6Itv36JNP63q3ugtancg6N1TNoHNZUcgMbKugFWK7kJRiG0LmgL58ZEsRiGlbI4zFJoz6n526IwjoA9q6trLbFMo0nlJgENzoXh8jORVlg3zZdw_NBrgen93eS271TmCs6LQQyZo0VXxYHnUxojhQUSmmrEfC61kp4nMQ80xZJ5QiLJDExDseQEIx5zj8eiq9EhqCdpoo4AdM1fgATrUp9TrL3ApHDma9Ik3pgZ9NE9Bk07QPP3Uj9jXo3Nyd-PL8FOfxIN5oPH4fMp2LXTVXK6zkA9-8jVuQEBmbgo5v4bmGqwMA
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE+26th+International+Symposium+on+Modeling%2C+Analysis%2C+and+Simulation+of+Computer+and+Telecommunication+Systems+%28MASCOTS%29&rft.atitle=A+Model-Based+Approach+to+Streamlining+Distributed+Training+for+Asynchronous+SGD&rft.au=Sung-Han+Lin&rft.au=Paolieri%2C+Marco&rft.au=Cheng-Fu+Chou&rft.au=Golubchik%2C+Leana&rft.date=2018-09-01&rft.pub=IEEE&rft.eissn=2375-0227&rft.spage=306&rft.epage=318&rft_id=info:doi/10.1109%2FMASCOTS.2018.00037&rft.externalDocID=8526895