A Model-Based Approach to Streamlining Distributed Training for Asynchronous SGD

The success of Deep Neural Networks (DNNs) has created significant interest in the development of software tools, hardware architectures, and cloud systems to meet the huge computational demand of their training jobs. A common approach to speeding up an individual job is to distribute training data...

Full description

Saved in:
Bibliographic Details
Published in2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) pp. 306 - 318
Main Authors Sung-Han Lin, Paolieri, Marco, Cheng-Fu Chou, Golubchik, Leana
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.09.2018
Subjects
Online AccessGet full text
ISSN2375-0227
DOI10.1109/MASCOTS.2018.00037

Cover

Abstract The success of Deep Neural Networks (DNNs) has created significant interest in the development of software tools, hardware architectures, and cloud systems to meet the huge computational demand of their training jobs. A common approach to speeding up an individual job is to distribute training data and computation among multiple nodes, periodically exchanging intermediate results. In this paper, we address two important problems for the application of this strategy to large-scale clusters and multiple, heterogeneous jobs. First, we propose and validate a queueing model to estimate the throughput of a training job as a function of the number of nodes assigned to the job; this model targets asynchronous Stochastic Gradient Descent (SGD), a popular strategy for distributed training, and requires only data from quick, two-node profiling in addition to job characteristics (number of requested training epochs, mini-batch size, size of DNN parameters, assigned bandwidth). Throughput estimations are then used to explore several classes of scheduling heuristics to reduce response time in a scenario where heterogeneous jobs are continuously submitted to a large-scale cluster. These scheduling algorithms dynamically select which jobs to run and how many nodes to assign to each job, based on different trade-offs between service time reduction and efficiency (e.g., speedup per additional node). Heuristics are evaluated through extensive simulations of realistic DNN workloads, also investigating the effects of early termination, a common scenario for DNN training jobs.
AbstractList The success of Deep Neural Networks (DNNs) has created significant interest in the development of software tools, hardware architectures, and cloud systems to meet the huge computational demand of their training jobs. A common approach to speeding up an individual job is to distribute training data and computation among multiple nodes, periodically exchanging intermediate results. In this paper, we address two important problems for the application of this strategy to large-scale clusters and multiple, heterogeneous jobs. First, we propose and validate a queueing model to estimate the throughput of a training job as a function of the number of nodes assigned to the job; this model targets asynchronous Stochastic Gradient Descent (SGD), a popular strategy for distributed training, and requires only data from quick, two-node profiling in addition to job characteristics (number of requested training epochs, mini-batch size, size of DNN parameters, assigned bandwidth). Throughput estimations are then used to explore several classes of scheduling heuristics to reduce response time in a scenario where heterogeneous jobs are continuously submitted to a large-scale cluster. These scheduling algorithms dynamically select which jobs to run and how many nodes to assign to each job, based on different trade-offs between service time reduction and efficiency (e.g., speedup per additional node). Heuristics are evaluated through extensive simulations of realistic DNN workloads, also investigating the effects of early termination, a common scenario for DNN training jobs.
Author Cheng-Fu Chou
Golubchik, Leana
Sung-Han Lin
Paolieri, Marco
Author_xml – sequence: 1
  surname: Sung-Han Lin
  fullname: Sung-Han Lin
  email: sunghan@usc.edu
  organization: Dept. of Comput. Sci., Univ. of Southern California, Los Angeles, CA, USA
– sequence: 2
  givenname: Marco
  surname: Paolieri
  fullname: Paolieri, Marco
  email: paolieri@usc.edu
  organization: Dept. of Comput. Sci., Univ. of Southern California, Los Angeles, CA, USA
– sequence: 3
  surname: Cheng-Fu Chou
  fullname: Cheng-Fu Chou
  email: ccf@csie.ntu.edu.tw
  organization: NetApp, Inc., Sunnyvale, CA, USA
– sequence: 4
  givenname: Leana
  surname: Golubchik
  fullname: Golubchik, Leana
  email: leana@usc.edu
  organization: Dept. of Comput. Sci., Univ. of Southern California, Los Angeles, CA, USA
BookMark eNotjN1OwjAYQKvRREBeQG_6Apttv3XtLicomkAw2bwmXX-kZqykHRe8vSR4dZKTkzNFd0MYLEJPlOSUkuplUzeLbdvkjFCZE0JA3KB5JSTlIMtSyhJu0YSB4BlhTDygaUq_hFxqDhP0VeNNMLbPXlWyBtfHYwxK7_EYcDNGqw69H_zwg5c-jdF3p_EStVFdpQsR1-k86H0MQzgl3KyWj-jeqT7Z-T9n6Pv9rV18ZOvt6nNRrzNPBR8z1zFrJC2FMGCLDjonHFSl0s5qSnWhpBOFrminQVcSOBiteAHKKKpMxxzM0PP16621u2P0BxXPO8lZKSsOf-OBUpU
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/MASCOTS.2018.00037
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9781538668863
1538668866
EISSN 2375-0227
EndPage 318
ExternalDocumentID 8526895
Genre orig-research
GroupedDBID 29O
6IE
6IK
6IL
AAJGR
ACGFS
ALMA_UNASSIGNED_HOLDINGS
CBEJK
M43
RIE
RIL
RNS
ID FETCH-LOGICAL-i175t-fb2ed81677d3e4b3bf7f396acfec11c4a8f74c91bc3c98353dca543ada1adb2f3
IEDL.DBID RIE
IngestDate Wed Aug 27 02:57:05 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i175t-fb2ed81677d3e4b3bf7f396acfec11c4a8f74c91bc3c98353dca543ada1adb2f3
PageCount 13
ParticipantIDs ieee_primary_8526895
PublicationCentury 2000
PublicationDate 2018-Sept.
PublicationDateYYYYMMDD 2018-09-01
PublicationDate_xml – month: 09
  year: 2018
  text: 2018-Sept.
PublicationDecade 2010
PublicationTitle 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)
PublicationTitleAbbrev MASCOT
PublicationYear 2018
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0020153
ssj0002685587
Score 2.2227306
Snippet The success of Deep Neural Networks (DNNs) has created significant interest in the development of software tools, hardware architectures, and cloud systems to...
SourceID ieee
SourceType Publisher
StartPage 306
SubjectTerms Computational modeling
Distributed Machine Learning
Load modeling
Machine learning
Parallel Schedul ing
Queueing Networks
Servers
TensorFlow
Throughput
Time factors
Training
Title A Model-Based Approach to Streamlining Distributed Training for Asynchronous SGD
URI https://ieeexplore.ieee.org/document/8526895
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LS8MwGA_bTp6mbuKbHDyarW2SpjlW5xzCZLANdht5gjhXce1B_3qTtttQPHgolKaUkKR835f8HgDcKMYZ5sQiwxKCSGgxEpRSJAJN3RUHinru8Pg5Hs3J04IuGuB2x4UxxpTgM9Pzt-VZvs5U4bfK-onXJuG0CZpumVVcrd1-imuhNNkXWy7M4S1JJuD9cTp1NfLUY7k8eDLAP-1UymgybIPxth8ViOS1V-Syp75-STT-t6OHoLvn7cHJLiIdgYZZH4P21rgB1v9xB0xS6D3QVujOxTAN01pXHOYZ9KfU4m1V2kbAgVfV9YZY7qVZ7SUBXZYL083nWnlZ3azYwOnjoAvmw4fZ_QjV1groxeULObIyMjoJY8Y0NkRiaZnFPBbKGhWGiojEMqJ4KBVW3CVpWCtBCRZahELLyOIT0Fpna3MKYODWAJY8YrFgxIaJK-Dc15Qruwl3uUd0Bjp-gJbvlXrGsh6b878fX4ADP0UViusStPKPwly5sJ_L63K-vwFQlqzd
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwGG0QD3pCBeNve_DoYKPtuh6nqKiMkAAJN9J2bWLEzch20L_edhsQjQcPS5atWZp2zfe-9n3vAXAlKaOIYe0oGmAHexo5nBDicDcm5vJdSWztcDT0-1P8NCOzGrhe18IopQrymWrb2-IsP05lbrfKOoHVJmFkC2wTk1UEZbXWekfFvCMk2KRbJtChVZmMyzpRODZZ8tiyuSx90kU_DVWKeHLfANGqJyWN5LWdZ6Itv36JNP63q3ugtancg6N1TNoHNZUcgMbKugFWK7kJRiG0LmgL58ZEsRiGlbI4zFJoz6n526IwjoA9q6trLbFMo0nlJgENzoXh8jORVlg3zZdw_NBrgen93eS271TmCs6LQQyZo0VXxYHnUxojhQUSmmrEfC61kp4nMQ80xZJ5QiLJDExDseQEIx5zj8eiq9EhqCdpoo4AdM1fgATrUp9TrL3ApHDma9Ik3pgZ9NE9Bk07QPP3Uj9jXo3Nyd-PL8FOfxIN5oPH4fMp2LXTVXK6zkA9-8jVuQEBmbgo5v4bmGqwMA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE+26th+International+Symposium+on+Modeling%2C+Analysis%2C+and+Simulation+of+Computer+and+Telecommunication+Systems+%28MASCOTS%29&rft.atitle=A+Model-Based+Approach+to+Streamlining+Distributed+Training+for+Asynchronous+SGD&rft.au=Sung-Han+Lin&rft.au=Paolieri%2C+Marco&rft.au=Cheng-Fu+Chou&rft.au=Golubchik%2C+Leana&rft.date=2018-09-01&rft.pub=IEEE&rft.eissn=2375-0227&rft.spage=306&rft.epage=318&rft_id=info:doi/10.1109%2FMASCOTS.2018.00037&rft.externalDocID=8526895