Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC

Recurrent neural networks (RNNs) provide state-of-the-art accuracy for performing analytics on datasets with sequence (e.g., language model). This paper studied a state-of-the-art RNN variant, Gated Recurrent Unit (GRU). We first proposed memoization optimization to avoid 3 out of the 6 dense matrix...

Full description

Saved in:
Bibliographic Details
Published inInternational Conference on Field-programmable Logic and Applications pp. 1 - 4
Main Authors Nurvitadhi, Eriko, Jaewoong Sim, Sheffield, David, Mishra, Asit, Krishnan, Srivatsan, Marr, Debbie
Format Conference Proceeding
LanguageEnglish
Published EPFL 01.08.2016
Subjects
Online AccessGet full text
ISSN1946-1488
DOI10.1109/FPL.2016.7577314

Cover

Abstract Recurrent neural networks (RNNs) provide state-of-the-art accuracy for performing analytics on datasets with sequence (e.g., language model). This paper studied a state-of-the-art RNN variant, Gated Recurrent Unit (GRU). We first proposed memoization optimization to avoid 3 out of the 6 dense matrix vector multiplications (SGEMVs) that are the majority of the computation in GRU. Then, we study the opportunities to accelerate the remaining SGEMVs using FPGAs, in comparison to 14-nm ASIC, GPU, and multi-core CPU. Results show that FPGA provides superior performance/Watt over CPU and GPU because FPGA's on-chip BRAMs, hard DSPs, and reconfigurable fabric allow for efficiently extracting fine-grained parallelisms from small/medium size matrices used by GRU. Moreover, newer FPGAs with more DSPs, on-chip BRAMs, and higher frequency have the potential to narrow the FPGA-ASIC efficiency gap.
AbstractList Recurrent neural networks (RNNs) provide state-of-the-art accuracy for performing analytics on datasets with sequence (e.g., language model). This paper studied a state-of-the-art RNN variant, Gated Recurrent Unit (GRU). We first proposed memoization optimization to avoid 3 out of the 6 dense matrix vector multiplications (SGEMVs) that are the majority of the computation in GRU. Then, we study the opportunities to accelerate the remaining SGEMVs using FPGAs, in comparison to 14-nm ASIC, GPU, and multi-core CPU. Results show that FPGA provides superior performance/Watt over CPU and GPU because FPGA's on-chip BRAMs, hard DSPs, and reconfigurable fabric allow for efficiently extracting fine-grained parallelisms from small/medium size matrices used by GRU. Moreover, newer FPGAs with more DSPs, on-chip BRAMs, and higher frequency have the potential to narrow the FPGA-ASIC efficiency gap.
Author Jaewoong Sim
Mishra, Asit
Krishnan, Srivatsan
Nurvitadhi, Eriko
Sheffield, David
Marr, Debbie
Author_xml – sequence: 1
  givenname: Eriko
  surname: Nurvitadhi
  fullname: Nurvitadhi, Eriko
  email: eriko.nurvitadhi@intel.com
  organization: Intel Corp., Hillsboro, OR, USA
– sequence: 2
  surname: Jaewoong Sim
  fullname: Jaewoong Sim
  organization: Intel Corp., Hillsboro, OR, USA
– sequence: 3
  givenname: David
  surname: Sheffield
  fullname: Sheffield, David
  organization: Intel Corp., Hillsboro, OR, USA
– sequence: 4
  givenname: Asit
  surname: Mishra
  fullname: Mishra, Asit
  organization: Intel Corp., Hillsboro, OR, USA
– sequence: 5
  givenname: Srivatsan
  surname: Krishnan
  fullname: Krishnan, Srivatsan
  organization: Intel Corp., Hillsboro, OR, USA
– sequence: 6
  givenname: Debbie
  surname: Marr
  fullname: Marr, Debbie
  organization: Intel Corp., Hillsboro, OR, USA
BookMark eNotkEFLwzAYhqMoOOfugpf8gHX2S9Im8VaKm4OBA915fM0SiXbpSDpl_96Kew_vw3N5D-8tuQpdsITcQz4DyPXjfL2asRzKmSyk5CAuyERLxRTXGpQQ7JKMQIsyA6HUDZmk9JkPKYRURTkipjLGtjZi78MHjdYcY7Shp8EeI7YD-p8ufiXqA8WA7an3JtFk47eN6YnW3f6A0acu0M7R-XpRTWm93kzp4q8w7Gj1tqzvyLXDNtnJmWOymT-_1y_Z6nWxrKtV5pmAPmsa7px2yBuuBLBcyVLygsFOKEAlB-VgGmROGSkK64RjuOMoNAjkmgk-Jg__u95auz1Ev8d42p5v4b8rL1YT
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/FPL.2016.7577314
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9782839918442
2839918447
EISSN 1946-1488
EndPage 4
ExternalDocumentID 7577314
Genre orig-research
GroupedDBID 6IE
6IF
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
OCL
RIE
RIL
ID FETCH-LOGICAL-i241t-bb3ff9fa3b38412087673521d481a8787631cba2f8c745ef4f2ad3a4914a39243
IEDL.DBID RIE
IngestDate Wed Aug 27 01:40:02 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i241t-bb3ff9fa3b38412087673521d481a8787631cba2f8c745ef4f2ad3a4914a39243
PageCount 4
ParticipantIDs ieee_primary_7577314
PublicationCentury 2000
PublicationDate 2016-08
PublicationDateYYYYMMDD 2016-08-01
PublicationDate_xml – month: 08
  year: 2016
  text: 2016-08
PublicationDecade 2010
PublicationTitle International Conference on Field-programmable Logic and Applications
PublicationTitleAbbrev FPL
PublicationYear 2016
Publisher EPFL
Publisher_xml – name: EPFL
SSID ssj0000547856
Score 2.0734534
Snippet Recurrent neural networks (RNNs) provide state-of-the-art accuracy for performing analytics on datasets with sequence (e.g., language model). This paper...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Classification algorithms
Field programmable gate arrays
Graphics processing units
Logic gates
Random access memory
Recurrent neural networks
Runtime
Title Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC
URI https://ieeexplore.ieee.org/document/7577314
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA5zJ08qm_ibHDyuXdskbeptFLspTgo62G3kpwylG7O7-Neb186K4sFLUwqhJQ_6vbx83_cQuqap-wXGkfZYKLRHU0M9ToX1TExVYBlVxIBQePoYT2b0fs7mHTRotTDGmJp8Zny4rc_y9UptoVQ2TFiSEOhavZfwuNFqtfWUAIypWHsSGaTDvHgA6lbs76b96J9Sw0d-gKZfL25YI6_-tpK--vjlyfjfLztE_W-hHi5aCDpCHVP2kBop5cAEQlu-4A0U1MGCCYN1pXhzQ038fsfLEguwJAGjZgzFWZcJ3uCsbUyIVxbnxXg0wFkxG-AxXESp8ejpLuujWX77nE28XS8Fb-kwuvKkJNamVhBJOA0jMKJLXO4VaspDwRPwpQuVFJHlKqHMWGojoYmgaUiFS6EoOUbdclWaE4RB-ZEGikkSC7d54dztxwNijSapDGNtT1EPFmixbuwyFru1Ofv78TnahyA1nLoL1K02W3PpcL6SV3WAPwEZwacN
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8NAEF1KPehJpRW_3YPHJk2ys_nwVoL90LYEbKG3stnsSlFSqenFX-9OUiOKBy_ZEFgSdiBvdva9N4TcQmR-gb6XWdwVmQWRAisEoS3lg3Q0B8kUCoUnU384h4cFXzRIp9bCKKVK8pmy8bY8y8_Wcoulsm7Ag4Bh1-o9DgC8UmvVFRUHral4fRbpRN1-Mkbylm_vJv7ooFICSP-QTL5eXfFGXuxtkdry45cr43-_7Yi0v6V6NKlB6Jg0VN4isielgRMMbv5MN1hSRxMmiuaV4tUMJfX7na5yKtCUBK2aKZZnTS54R-O6NSFda9pPBr0OjZN5hw7wIvKM9p5GcZvM-_ezeGjtuilYK4PShZWmTOtIC5ayEFwPregCk325GYSuCAN0pnNlKjwdygC40qA9kTEBkQvCJFHATkgzX-fqlFDUfkSO5Cnzhdm-hKHZkTtMq4xFqetn-oy0cIGWb5VhxnK3Nud_P74h-8PZZLwcj6aPF-QAA1Yx7C5Js9hs1ZVB_SK9LoP9CQpyqlo
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=International+Conference+on+Field-programmable+Logic+and+Applications&rft.atitle=Accelerating+recurrent+neural+networks+in+analytics+servers%3A+Comparison+of+FPGA%2C+CPU%2C+GPU%2C+and+ASIC&rft.au=Nurvitadhi%2C+Eriko&rft.au=Jaewoong+Sim&rft.au=Sheffield%2C+David&rft.au=Mishra%2C+Asit&rft.date=2016-08-01&rft.pub=EPFL&rft.eissn=1946-1488&rft.spage=1&rft.epage=4&rft_id=info:doi/10.1109%2FFPL.2016.7577314&rft.externalDocID=7577314