Free launch: Optimizing GPU dynamic kernel launches through thread reuse

Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkerne...

Full description

Saved in:

Bibliographic Details
Published in	2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) pp. 407 - 419
Main Authors	Guoyang Chen, Xipeng Shen
Format	Conference Proceeding
Language	English
Published	ACM 01.12.2015
Subjects	Compiler Dynamic Parallelism GPU Graphics processing units Hardware Instruction sets Kernel Optimization Parallel processing Runtime Runtime Adaptation Thread Reuse
Online Access	Get full text
ISSN	2379-3155
DOI	10.1145/2830772.2830818

Cover

Abstract	Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkernel launches. Neither is satisfactory. The former is complicated to program and is often subject to some load imbalance; the latter suffers large runtime overhead. In this work, we propose free launch, a new software approach to overcoming the shortcomings of both methods. It allows programmers to use subkernel launches to express dynamic parallelism. It employs a novel compiler-based code transformation named subkernel launch removal to replace the subkernel launches with the reuse of parent threads. Coupled with an adaptive task assignment mechanism, the transformation reassigns the tasks in the subkernels to the parent threads with a good load balance. The technique requires no hardware extensions, immediately deployable on existing GPUs. It keeps the programming convenience of the subkernel launch-based approach while avoiding its large runtime overhead. Meanwhile, its superior load balancing makes it outperform manual worklist-based techniques by 3X on average.
AbstractList	Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkernel launches. Neither is satisfactory. The former is complicated to program and is often subject to some load imbalance; the latter suffers large runtime overhead. In this work, we propose free launch, a new software approach to overcoming the shortcomings of both methods. It allows programmers to use subkernel launches to express dynamic parallelism. It employs a novel compiler-based code transformation named subkernel launch removal to replace the subkernel launches with the reuse of parent threads. Coupled with an adaptive task assignment mechanism, the transformation reassigns the tasks in the subkernels to the parent threads with a good load balance. The technique requires no hardware extensions, immediately deployable on existing GPUs. It keeps the programming convenience of the subkernel launch-based approach while avoiding its large runtime overhead. Meanwhile, its superior load balancing makes it outperform manual worklist-based techniques by 3X on average.
Author	Guoyang Chen Xipeng Shen
Author_xml	– sequence: 1 surname: Guoyang Chen fullname: Guoyang Chen email: gchen11@ncsu.edu organization: Comput. Sci. Dept., North Carolina State Univ., Raleigh, NC, USA – sequence: 2 surname: Xipeng Shen fullname: Xipeng Shen email: xshen5@ncsu.edu organization: Comput. Sci. Dept., North Carolina State Univ., Raleigh, NC, USA
BookMark	eNotzEtrwkAUBeBpaaFqXXfRzfyB2Jl755XuilQtCHZR1zJJbszQZJQ8FvbXV1E48MHhcMbsIR4iMfYixUxKpd_AobAWZheddHdsfG4FqnPgno0AbZqg1PqJTbsuZAIFoDMII7ZatES89kPMq3e-OfahCX8h7vnye8uLU_RNyPkvtZHq24o63lftYdhXF8kXvKWho2f2WPq6o-nNCdsuPn_mq2S9WX7NP9aJB2X7RKfSCaNStLm2AKLwUpZgQWQejTAaQGEpNeU6Q6sL5SWkKvPGSyxEmRJO2Ov1NxDR7tiGxrennXXaGKnxHzmUTCE
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1145/2830772.2830818
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	1450340342 9781450340342
EISSN	2379-3155
EndPage	419
ExternalDocumentID	7856615
Genre	orig-research
GroupedDBID	6IE 6IL ABLEC ALMA_UNASSIGNED_HOLDINGS CBEJK IEGSK RIE RIL
ID	FETCH-LOGICAL-a247t-5918064937c57220da11f2720ba360652243f15ec5b375d4a1294ba6a13d0f9e3
IEDL.DBID	RIE
IngestDate	Wed Aug 27 02:02:01 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a247t-5918064937c57220da11f2720ba360652243f15ec5b375d4a1294ba6a13d0f9e3
PageCount	13
ParticipantIDs	ieee_primary_7856615
PublicationCentury	2000
PublicationDate	2015-Dec.
PublicationDateYYYYMMDD	2015-12-01
PublicationDate_xml	– month: 12 year: 2015 text: 2015-Dec.
PublicationDecade	2010
PublicationTitle	2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
PublicationTitleAbbrev	MICRO
PublicationYear	2015
Publisher	ACM
Publisher_xml	– name: ACM
SSID	ssib030238632 ssib023363937 ssib042476800
Score	2.190611
Snippet	Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit...
SourceID	ieee
SourceType	Publisher
StartPage	407
SubjectTerms	Compiler Dynamic Parallelism GPU Graphics processing units Hardware Instruction sets Kernel Optimization Parallel processing Runtime Runtime Adaptation Thread Reuse
Title	Free launch: Optimizing GPU dynamic kernel launches through thread reuse
URI	https://ieeexplore.ieee.org/document/7856615
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELVKJyZALeJbHhhJmsR2ErMiSoVU6EClbpUdnwVqSVGaLP31nJOUVoiByVGSwTmffM-5e_cIucUYAzrQ2jOO-4EHiAz3Qck843rfof-AZI6NPH6JR1P-PBOzDrn74cIAQF18Br67rHP5ZpVV7lfZIEkRfDhG-UGSxg1Xa-s7EWMx2wu1TgsnjXecSR5xBNZB0Hb3CbkYuNZXiC19N6ZO9GNPXqWOLsMjMt7OqykqWfhVqf1s86tl438nfkz6Ox4fnfxEqBPSgbxHRsMCgC4VBrT3e_qKW8bnxwYf06fJlJpGoJ4uoMhh2b4Fa9rK-bgRnYIWUK2hT6bDx7eHkdfqKXgKv770hAxTRCBopUwkURQYFYbW5WG1YniOQSTGmQ0FZEKzRBiuEAtwrWIVMhNYCeyUdPNVDmeEWq5sYkOjpdQuEyghTgIVGWVlIGymzknPWWH-1bTMmLcGuPj79iU5RBwimiqRK9ItiwquMdaX-qZe5G9R76Ux
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELWqMsAEqEV844GRpElsJw0rogRoS4dW6lbZ8UWglhSlydJfzzlJS4UYmBwlGZzzyfecu3ePkFuMMaAcpSxtuB94gIhxHwyZpU3vO_QfCJlhIw-GfjThL1MxbZC7LRcGAMriM7DNZZnL18u4ML_KOkEXwYdhlO8Jzrmo2Fob7_EY89lOsDVqOF3_hzXJPY7Q2nHq_j4uFx3T_ArRpW3GrpH92BFYKeNL75AMNjOrykrmdpErO17_atr436kfkfYPk4-OtjHqmDQgbZGolwHQhcSQ9n5P33DT-PxY42P6NJpQXUnU0zlkKSzqt2BFa0EfM6Jb0AyKFbTJpPc4foisWlHBkvj1uSVCt4sYBK0Ui8DzHC1dNzGZWCUZnmQQi3GWuAJioVggNJeIBriSvnSZdpIQ2AlppssUTglNuEyCxNUqDJXJBYbgB470tExCRySxPCMtY4XZV9U0Y1Yb4Pzv2zdkPxoP-rP-8_D1ghwgKhFVzcglaeZZAVcY-XN1XS74NyzqqH4
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=2015+48th+Annual+IEEE%2FACM+International+Symposium+on+Microarchitecture+%28MICRO%29&rft.atitle=Free+launch%3A+Optimizing+GPU+dynamic+kernel+launches+through+thread+reuse&rft.au=Guoyang+Chen&rft.au=Xipeng+Shen&rft.date=2015-12-01&rft.pub=ACM&rft.eissn=2379-3155&rft.spage=407&rft.epage=419&rft_id=info:doi/10.1145%2F2830772.2830818&rft.externalDocID=7856615