A Model-Based Approach to Streamlining Distributed Training for Asynchronous SGD
The success of Deep Neural Networks (DNNs) has created significant interest in the development of software tools, hardware architectures, and cloud systems to meet the huge computational demand of their training jobs. A common approach to speeding up an individual job is to distribute training data...
Saved in:
Published in | 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) pp. 306 - 318 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.09.2018
|
Subjects | |
Online Access | Get full text |
ISSN | 2375-0227 |
DOI | 10.1109/MASCOTS.2018.00037 |
Cover
Abstract | The success of Deep Neural Networks (DNNs) has created significant interest in the development of software tools, hardware architectures, and cloud systems to meet the huge computational demand of their training jobs. A common approach to speeding up an individual job is to distribute training data and computation among multiple nodes, periodically exchanging intermediate results. In this paper, we address two important problems for the application of this strategy to large-scale clusters and multiple, heterogeneous jobs. First, we propose and validate a queueing model to estimate the throughput of a training job as a function of the number of nodes assigned to the job; this model targets asynchronous Stochastic Gradient Descent (SGD), a popular strategy for distributed training, and requires only data from quick, two-node profiling in addition to job characteristics (number of requested training epochs, mini-batch size, size of DNN parameters, assigned bandwidth). Throughput estimations are then used to explore several classes of scheduling heuristics to reduce response time in a scenario where heterogeneous jobs are continuously submitted to a large-scale cluster. These scheduling algorithms dynamically select which jobs to run and how many nodes to assign to each job, based on different trade-offs between service time reduction and efficiency (e.g., speedup per additional node). Heuristics are evaluated through extensive simulations of realistic DNN workloads, also investigating the effects of early termination, a common scenario for DNN training jobs. |
---|---|
AbstractList | The success of Deep Neural Networks (DNNs) has created significant interest in the development of software tools, hardware architectures, and cloud systems to meet the huge computational demand of their training jobs. A common approach to speeding up an individual job is to distribute training data and computation among multiple nodes, periodically exchanging intermediate results. In this paper, we address two important problems for the application of this strategy to large-scale clusters and multiple, heterogeneous jobs. First, we propose and validate a queueing model to estimate the throughput of a training job as a function of the number of nodes assigned to the job; this model targets asynchronous Stochastic Gradient Descent (SGD), a popular strategy for distributed training, and requires only data from quick, two-node profiling in addition to job characteristics (number of requested training epochs, mini-batch size, size of DNN parameters, assigned bandwidth). Throughput estimations are then used to explore several classes of scheduling heuristics to reduce response time in a scenario where heterogeneous jobs are continuously submitted to a large-scale cluster. These scheduling algorithms dynamically select which jobs to run and how many nodes to assign to each job, based on different trade-offs between service time reduction and efficiency (e.g., speedup per additional node). Heuristics are evaluated through extensive simulations of realistic DNN workloads, also investigating the effects of early termination, a common scenario for DNN training jobs. |
Author | Cheng-Fu Chou Golubchik, Leana Sung-Han Lin Paolieri, Marco |
Author_xml | – sequence: 1 surname: Sung-Han Lin fullname: Sung-Han Lin email: sunghan@usc.edu organization: Dept. of Comput. Sci., Univ. of Southern California, Los Angeles, CA, USA – sequence: 2 givenname: Marco surname: Paolieri fullname: Paolieri, Marco email: paolieri@usc.edu organization: Dept. of Comput. Sci., Univ. of Southern California, Los Angeles, CA, USA – sequence: 3 surname: Cheng-Fu Chou fullname: Cheng-Fu Chou email: ccf@csie.ntu.edu.tw organization: NetApp, Inc., Sunnyvale, CA, USA – sequence: 4 givenname: Leana surname: Golubchik fullname: Golubchik, Leana email: leana@usc.edu organization: Dept. of Comput. Sci., Univ. of Southern California, Los Angeles, CA, USA |
BookMark | eNotjN1OwjAYQKvRREBeQG_6Apttv3XtLicomkAw2bwmXX-kZqykHRe8vSR4dZKTkzNFd0MYLEJPlOSUkuplUzeLbdvkjFCZE0JA3KB5JSTlIMtSyhJu0YSB4BlhTDygaUq_hFxqDhP0VeNNMLbPXlWyBtfHYwxK7_EYcDNGqw69H_zwg5c-jdF3p_EStVFdpQsR1-k86H0MQzgl3KyWj-jeqT7Z-T9n6Pv9rV18ZOvt6nNRrzNPBR8z1zFrJC2FMGCLDjonHFSl0s5qSnWhpBOFrminQVcSOBiteAHKKKpMxxzM0PP16621u2P0BxXPO8lZKSsOf-OBUpU |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/MASCOTS.2018.00037 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISBN | 9781538668863 1538668866 |
EISSN | 2375-0227 |
EndPage | 318 |
ExternalDocumentID | 8526895 |
Genre | orig-research |
GroupedDBID | 29O 6IE 6IK 6IL AAJGR ACGFS ALMA_UNASSIGNED_HOLDINGS CBEJK M43 RIE RIL RNS |
ID | FETCH-LOGICAL-i175t-fb2ed81677d3e4b3bf7f396acfec11c4a8f74c91bc3c98353dca543ada1adb2f3 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 02:57:05 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i175t-fb2ed81677d3e4b3bf7f396acfec11c4a8f74c91bc3c98353dca543ada1adb2f3 |
PageCount | 13 |
ParticipantIDs | ieee_primary_8526895 |
PublicationCentury | 2000 |
PublicationDate | 2018-Sept. |
PublicationDateYYYYMMDD | 2018-09-01 |
PublicationDate_xml | – month: 09 year: 2018 text: 2018-Sept. |
PublicationDecade | 2010 |
PublicationTitle | 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) |
PublicationTitleAbbrev | MASCOT |
PublicationYear | 2018 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0020153 ssj0002685587 |
Score | 2.2227306 |
Snippet | The success of Deep Neural Networks (DNNs) has created significant interest in the development of software tools, hardware architectures, and cloud systems to... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 306 |
SubjectTerms | Computational modeling Distributed Machine Learning Load modeling Machine learning Parallel Schedul ing Queueing Networks Servers TensorFlow Throughput Time factors Training |
Title | A Model-Based Approach to Streamlining Distributed Training for Asynchronous SGD |
URI | https://ieeexplore.ieee.org/document/8526895 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LS8MwGA_bTp6mbuKbHDyarW2SpjlW5xzCZLANdht5gjhXce1B_3qTtttQPHgolKaUkKR835f8HgDcKMYZ5sQiwxKCSGgxEpRSJAJN3RUHinru8Pg5Hs3J04IuGuB2x4UxxpTgM9Pzt-VZvs5U4bfK-onXJuG0CZpumVVcrd1-imuhNNkXWy7M4S1JJuD9cTp1NfLUY7k8eDLAP-1UymgybIPxth8ViOS1V-Syp75-STT-t6OHoLvn7cHJLiIdgYZZH4P21rgB1v9xB0xS6D3QVujOxTAN01pXHOYZ9KfU4m1V2kbAgVfV9YZY7qVZ7SUBXZYL083nWnlZ3azYwOnjoAvmw4fZ_QjV1groxeULObIyMjoJY8Y0NkRiaZnFPBbKGhWGiojEMqJ4KBVW3CVpWCtBCRZahELLyOIT0Fpna3MKYODWAJY8YrFgxIaJK-Dc15Qruwl3uUd0Bjp-gJbvlXrGsh6b878fX4ADP0UViusStPKPwly5sJ_L63K-vwFQlqzd |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwGG0QD3pCBeNve_DoYKPtuh6nqKiMkAAJN9J2bWLEzch20L_edhsQjQcPS5atWZp2zfe-9n3vAXAlKaOIYe0oGmAHexo5nBDicDcm5vJdSWztcDT0-1P8NCOzGrhe18IopQrymWrb2-IsP05lbrfKOoHVJmFkC2wTk1UEZbXWekfFvCMk2KRbJtChVZmMyzpRODZZ8tiyuSx90kU_DVWKeHLfANGqJyWN5LWdZ6Itv36JNP63q3ugtancg6N1TNoHNZUcgMbKugFWK7kJRiG0LmgL58ZEsRiGlbI4zFJoz6n526IwjoA9q6trLbFMo0nlJgENzoXh8jORVlg3zZdw_NBrgen93eS271TmCs6LQQyZo0VXxYHnUxojhQUSmmrEfC61kp4nMQ80xZJ5QiLJDExDseQEIx5zj8eiq9EhqCdpoo4AdM1fgATrUp9TrL3ApHDma9Ik3pgZ9NE9Bk07QPP3Uj9jXo3Nyd-PL8FOfxIN5oPH4fMp2LXTVXK6zkA9-8jVuQEBmbgo5v4bmGqwMA |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE+26th+International+Symposium+on+Modeling%2C+Analysis%2C+and+Simulation+of+Computer+and+Telecommunication+Systems+%28MASCOTS%29&rft.atitle=A+Model-Based+Approach+to+Streamlining+Distributed+Training+for+Asynchronous+SGD&rft.au=Sung-Han+Lin&rft.au=Paolieri%2C+Marco&rft.au=Cheng-Fu+Chou&rft.au=Golubchik%2C+Leana&rft.date=2018-09-01&rft.pub=IEEE&rft.eissn=2375-0227&rft.spage=306&rft.epage=318&rft_id=info:doi/10.1109%2FMASCOTS.2018.00037&rft.externalDocID=8526895 |