Free launch: Optimizing GPU dynamic kernel launches through thread reuse
Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkerne...
Saved in:
| Published in | 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) pp. 407 - 419 |
|---|---|
| Main Authors | , |
| Format | Conference Proceeding |
| Language | English |
| Published |
ACM
01.12.2015
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 2379-3155 |
| DOI | 10.1145/2830772.2830818 |
Cover
| Abstract | Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkernel launches. Neither is satisfactory. The former is complicated to program and is often subject to some load imbalance; the latter suffers large runtime overhead. In this work, we propose free launch, a new software approach to overcoming the shortcomings of both methods. It allows programmers to use subkernel launches to express dynamic parallelism. It employs a novel compiler-based code transformation named subkernel launch removal to replace the subkernel launches with the reuse of parent threads. Coupled with an adaptive task assignment mechanism, the transformation reassigns the tasks in the subkernels to the parent threads with a good load balance. The technique requires no hardware extensions, immediately deployable on existing GPUs. It keeps the programming convenience of the subkernel launch-based approach while avoiding its large runtime overhead. Meanwhile, its superior load balancing makes it outperform manual worklist-based techniques by 3X on average. |
|---|---|
| AbstractList | Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkernel launches. Neither is satisfactory. The former is complicated to program and is often subject to some load imbalance; the latter suffers large runtime overhead. In this work, we propose free launch, a new software approach to overcoming the shortcomings of both methods. It allows programmers to use subkernel launches to express dynamic parallelism. It employs a novel compiler-based code transformation named subkernel launch removal to replace the subkernel launches with the reuse of parent threads. Coupled with an adaptive task assignment mechanism, the transformation reassigns the tasks in the subkernels to the parent threads with a good load balance. The technique requires no hardware extensions, immediately deployable on existing GPUs. It keeps the programming convenience of the subkernel launch-based approach while avoiding its large runtime overhead. Meanwhile, its superior load balancing makes it outperform manual worklist-based techniques by 3X on average. |
| Author | Guoyang Chen Xipeng Shen |
| Author_xml | – sequence: 1 surname: Guoyang Chen fullname: Guoyang Chen email: gchen11@ncsu.edu organization: Comput. Sci. Dept., North Carolina State Univ., Raleigh, NC, USA – sequence: 2 surname: Xipeng Shen fullname: Xipeng Shen email: xshen5@ncsu.edu organization: Comput. Sci. Dept., North Carolina State Univ., Raleigh, NC, USA |
| BookMark | eNotzEtrwkAUBeBpaaFqXXfRzfyB2Jl755XuilQtCHZR1zJJbszQZJQ8FvbXV1E48MHhcMbsIR4iMfYixUxKpd_AobAWZheddHdsfG4FqnPgno0AbZqg1PqJTbsuZAIFoDMII7ZatES89kPMq3e-OfahCX8h7vnye8uLU_RNyPkvtZHq24o63lftYdhXF8kXvKWho2f2WPq6o-nNCdsuPn_mq2S9WX7NP9aJB2X7RKfSCaNStLm2AKLwUpZgQWQejTAaQGEpNeU6Q6sL5SWkKvPGSyxEmRJO2Ov1NxDR7tiGxrennXXaGKnxHzmUTCE |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1145/2830772.2830818 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 1450340342 9781450340342 |
| EISSN | 2379-3155 |
| EndPage | 419 |
| ExternalDocumentID | 7856615 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IL ABLEC ALMA_UNASSIGNED_HOLDINGS CBEJK IEGSK RIE RIL |
| ID | FETCH-LOGICAL-a247t-5918064937c57220da11f2720ba360652243f15ec5b375d4a1294ba6a13d0f9e3 |
| IEDL.DBID | RIE |
| IngestDate | Wed Aug 27 02:02:01 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a247t-5918064937c57220da11f2720ba360652243f15ec5b375d4a1294ba6a13d0f9e3 |
| PageCount | 13 |
| ParticipantIDs | ieee_primary_7856615 |
| PublicationCentury | 2000 |
| PublicationDate | 2015-Dec. |
| PublicationDateYYYYMMDD | 2015-12-01 |
| PublicationDate_xml | – month: 12 year: 2015 text: 2015-Dec. |
| PublicationDecade | 2010 |
| PublicationTitle | 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) |
| PublicationTitleAbbrev | MICRO |
| PublicationYear | 2015 |
| Publisher | ACM |
| Publisher_xml | – name: ACM |
| SSID | ssib030238632 ssib023363937 ssib042476800 |
| Score | 2.190611 |
| Snippet | Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 407 |
| SubjectTerms | Compiler Dynamic Parallelism GPU Graphics processing units Hardware Instruction sets Kernel Optimization Parallel processing Runtime Runtime Adaptation Thread Reuse |
| Title | Free launch: Optimizing GPU dynamic kernel launches through thread reuse |
| URI | https://ieeexplore.ieee.org/document/7856615 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELVKJyZALeJbHhhJmsR2ErMiSoVU6EClbpUdnwVqSVGaLP31nJOUVoiByVGSwTmffM-5e_cIucUYAzrQ2jOO-4EHiAz3Qck843rfof-AZI6NPH6JR1P-PBOzDrn74cIAQF18Br67rHP5ZpVV7lfZIEkRfDhG-UGSxg1Xa-s7EWMx2wu1TgsnjXecSR5xBNZB0Hb3CbkYuNZXiC19N6ZO9GNPXqWOLsMjMt7OqykqWfhVqf1s86tl438nfkz6Ox4fnfxEqBPSgbxHRsMCgC4VBrT3e_qKW8bnxwYf06fJlJpGoJ4uoMhh2b4Fa9rK-bgRnYIWUK2hT6bDx7eHkdfqKXgKv770hAxTRCBopUwkURQYFYbW5WG1YniOQSTGmQ0FZEKzRBiuEAtwrWIVMhNYCeyUdPNVDmeEWq5sYkOjpdQuEyghTgIVGWVlIGymzknPWWH-1bTMmLcGuPj79iU5RBwimiqRK9ItiwquMdaX-qZe5G9R76Ux |
| linkProvider | IEEE |
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELWqMsAEqEV844GRpElsJw0rogRoS4dW6lbZ8UWglhSlydJfzzlJS4UYmBwlGZzzyfecu3ePkFuMMaAcpSxtuB94gIhxHwyZpU3vO_QfCJlhIw-GfjThL1MxbZC7LRcGAMriM7DNZZnL18u4ML_KOkEXwYdhlO8Jzrmo2Fob7_EY89lOsDVqOF3_hzXJPY7Q2nHq_j4uFx3T_ArRpW3GrpH92BFYKeNL75AMNjOrykrmdpErO17_atr436kfkfYPk4-OtjHqmDQgbZGolwHQhcSQ9n5P33DT-PxY42P6NJpQXUnU0zlkKSzqt2BFa0EfM6Jb0AyKFbTJpPc4foisWlHBkvj1uSVCt4sYBK0Ui8DzHC1dNzGZWCUZnmQQi3GWuAJioVggNJeIBriSvnSZdpIQ2AlppssUTglNuEyCxNUqDJXJBYbgB470tExCRySxPCMtY4XZV9U0Y1Yb4Pzv2zdkPxoP-rP-8_D1ghwgKhFVzcglaeZZAVcY-XN1XS74NyzqqH4 |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=2015+48th+Annual+IEEE%2FACM+International+Symposium+on+Microarchitecture+%28MICRO%29&rft.atitle=Free+launch%3A+Optimizing+GPU+dynamic+kernel+launches+through+thread+reuse&rft.au=Guoyang+Chen&rft.au=Xipeng+Shen&rft.date=2015-12-01&rft.pub=ACM&rft.eissn=2379-3155&rft.spage=407&rft.epage=419&rft_id=info:doi/10.1145%2F2830772.2830818&rft.externalDocID=7856615 |