Free launch: Optimizing GPU dynamic kernel launches through thread reuse

Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkerne...

Full description

Saved in:
Bibliographic Details
Published in2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) pp. 407 - 419
Main Authors Guoyang Chen, Xipeng Shen
Format Conference Proceeding
LanguageEnglish
Published ACM 01.12.2015
Subjects
Online AccessGet full text
ISSN2379-3155
DOI10.1145/2830772.2830818

Cover

Abstract Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkernel launches. Neither is satisfactory. The former is complicated to program and is often subject to some load imbalance; the latter suffers large runtime overhead. In this work, we propose free launch, a new software approach to overcoming the shortcomings of both methods. It allows programmers to use subkernel launches to express dynamic parallelism. It employs a novel compiler-based code transformation named subkernel launch removal to replace the subkernel launches with the reuse of parent threads. Coupled with an adaptive task assignment mechanism, the transformation reassigns the tasks in the subkernels to the parent threads with a good load balance. The technique requires no hardware extensions, immediately deployable on existing GPUs. It keeps the programming convenience of the subkernel launch-based approach while avoiding its large runtime overhead. Meanwhile, its superior load balancing makes it outperform manual worklist-based techniques by 3X on average.
AbstractList Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkernel launches. Neither is satisfactory. The former is complicated to program and is often subject to some load imbalance; the latter suffers large runtime overhead. In this work, we propose free launch, a new software approach to overcoming the shortcomings of both methods. It allows programmers to use subkernel launches to express dynamic parallelism. It employs a novel compiler-based code transformation named subkernel launch removal to replace the subkernel launches with the reuse of parent threads. Coupled with an adaptive task assignment mechanism, the transformation reassigns the tasks in the subkernels to the parent threads with a good load balance. The technique requires no hardware extensions, immediately deployable on existing GPUs. It keeps the programming convenience of the subkernel launch-based approach while avoiding its large runtime overhead. Meanwhile, its superior load balancing makes it outperform manual worklist-based techniques by 3X on average.
Author Guoyang Chen
Xipeng Shen
Author_xml – sequence: 1
  surname: Guoyang Chen
  fullname: Guoyang Chen
  email: gchen11@ncsu.edu
  organization: Comput. Sci. Dept., North Carolina State Univ., Raleigh, NC, USA
– sequence: 2
  surname: Xipeng Shen
  fullname: Xipeng Shen
  email: xshen5@ncsu.edu
  organization: Comput. Sci. Dept., North Carolina State Univ., Raleigh, NC, USA
BookMark eNotzEtrwkAUBeBpaaFqXXfRzfyB2Jl755XuilQtCHZR1zJJbszQZJQ8FvbXV1E48MHhcMbsIR4iMfYixUxKpd_AobAWZheddHdsfG4FqnPgno0AbZqg1PqJTbsuZAIFoDMII7ZatES89kPMq3e-OfahCX8h7vnye8uLU_RNyPkvtZHq24o63lftYdhXF8kXvKWho2f2WPq6o-nNCdsuPn_mq2S9WX7NP9aJB2X7RKfSCaNStLm2AKLwUpZgQWQejTAaQGEpNeU6Q6sL5SWkKvPGSyxEmRJO2Ov1NxDR7tiGxrennXXaGKnxHzmUTCE
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1145/2830772.2830818
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 1450340342
9781450340342
EISSN 2379-3155
EndPage 419
ExternalDocumentID 7856615
Genre orig-research
GroupedDBID 6IE
6IL
ABLEC
ALMA_UNASSIGNED_HOLDINGS
CBEJK
IEGSK
RIE
RIL
ID FETCH-LOGICAL-a247t-5918064937c57220da11f2720ba360652243f15ec5b375d4a1294ba6a13d0f9e3
IEDL.DBID RIE
IngestDate Wed Aug 27 02:02:01 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a247t-5918064937c57220da11f2720ba360652243f15ec5b375d4a1294ba6a13d0f9e3
PageCount 13
ParticipantIDs ieee_primary_7856615
PublicationCentury 2000
PublicationDate 2015-Dec.
PublicationDateYYYYMMDD 2015-12-01
PublicationDate_xml – month: 12
  year: 2015
  text: 2015-Dec.
PublicationDecade 2010
PublicationTitle 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
PublicationTitleAbbrev MICRO
PublicationYear 2015
Publisher ACM
Publisher_xml – name: ACM
SSID ssib030238632
ssib023363937
ssib042476800
Score 2.190611
Snippet Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit...
SourceID ieee
SourceType Publisher
StartPage 407
SubjectTerms Compiler
Dynamic Parallelism
GPU
Graphics processing units
Hardware
Instruction sets
Kernel
Optimization
Parallel processing
Runtime
Runtime Adaptation
Thread Reuse
Title Free launch: Optimizing GPU dynamic kernel launches through thread reuse
URI https://ieeexplore.ieee.org/document/7856615
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELVKJyZALeJbHhhJmsR2ErMiSoVU6EClbpUdnwVqSVGaLP31nJOUVoiByVGSwTmffM-5e_cIucUYAzrQ2jOO-4EHiAz3Qck843rfof-AZI6NPH6JR1P-PBOzDrn74cIAQF18Br67rHP5ZpVV7lfZIEkRfDhG-UGSxg1Xa-s7EWMx2wu1TgsnjXecSR5xBNZB0Hb3CbkYuNZXiC19N6ZO9GNPXqWOLsMjMt7OqykqWfhVqf1s86tl438nfkz6Ox4fnfxEqBPSgbxHRsMCgC4VBrT3e_qKW8bnxwYf06fJlJpGoJ4uoMhh2b4Fa9rK-bgRnYIWUK2hT6bDx7eHkdfqKXgKv770hAxTRCBopUwkURQYFYbW5WG1YniOQSTGmQ0FZEKzRBiuEAtwrWIVMhNYCeyUdPNVDmeEWq5sYkOjpdQuEyghTgIVGWVlIGymzknPWWH-1bTMmLcGuPj79iU5RBwimiqRK9ItiwquMdaX-qZe5G9R76Ux
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELWqMsAEqEV844GRpElsJw0rogRoS4dW6lbZ8UWglhSlydJfzzlJS4UYmBwlGZzzyfecu3ePkFuMMaAcpSxtuB94gIhxHwyZpU3vO_QfCJlhIw-GfjThL1MxbZC7LRcGAMriM7DNZZnL18u4ML_KOkEXwYdhlO8Jzrmo2Fob7_EY89lOsDVqOF3_hzXJPY7Q2nHq_j4uFx3T_ArRpW3GrpH92BFYKeNL75AMNjOrykrmdpErO17_atr436kfkfYPk4-OtjHqmDQgbZGolwHQhcSQ9n5P33DT-PxY42P6NJpQXUnU0zlkKSzqt2BFa0EfM6Jb0AyKFbTJpPc4foisWlHBkvj1uSVCt4sYBK0Ui8DzHC1dNzGZWCUZnmQQi3GWuAJioVggNJeIBriSvnSZdpIQ2AlppssUTglNuEyCxNUqDJXJBYbgB470tExCRySxPCMtY4XZV9U0Y1Yb4Pzv2zdkPxoP-rP-8_D1ghwgKhFVzcglaeZZAVcY-XN1XS74NyzqqH4
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=2015+48th+Annual+IEEE%2FACM+International+Symposium+on+Microarchitecture+%28MICRO%29&rft.atitle=Free+launch%3A+Optimizing+GPU+dynamic+kernel+launches+through+thread+reuse&rft.au=Guoyang+Chen&rft.au=Xipeng+Shen&rft.date=2015-12-01&rft.pub=ACM&rft.eissn=2379-3155&rft.spage=407&rft.epage=419&rft_id=info:doi/10.1145%2F2830772.2830818&rft.externalDocID=7856615