Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC

Recurrent neural networks (RNNs) provide state-of-the-art accuracy for performing analytics on datasets with sequence (e.g., language model). This paper studied a state-of-the-art RNN variant, Gated Recurrent Unit (GRU). We first proposed memoization optimization to avoid 3 out of the 6 dense matrix...

Full description

Saved in:

Bibliographic Details
Published in	International Conference on Field-programmable Logic and Applications pp. 1 - 4
Main Authors	Nurvitadhi, Eriko, Jaewoong Sim, Sheffield, David, Mishra, Asit, Krishnan, Srivatsan, Marr, Debbie
Format	Conference Proceeding
Language	English
Published	EPFL 01.08.2016
Subjects	Classification algorithms Field programmable gate arrays Graphics processing units Logic gates Random access memory Recurrent neural networks Runtime
Online Access	Get full text
ISSN	1946-1488
DOI	10.1109/FPL.2016.7577314

Cover

Abstract	Recurrent neural networks (RNNs) provide state-of-the-art accuracy for performing analytics on datasets with sequence (e.g., language model). This paper studied a state-of-the-art RNN variant, Gated Recurrent Unit (GRU). We first proposed memoization optimization to avoid 3 out of the 6 dense matrix vector multiplications (SGEMVs) that are the majority of the computation in GRU. Then, we study the opportunities to accelerate the remaining SGEMVs using FPGAs, in comparison to 14-nm ASIC, GPU, and multi-core CPU. Results show that FPGA provides superior performance/Watt over CPU and GPU because FPGA's on-chip BRAMs, hard DSPs, and reconfigurable fabric allow for efficiently extracting fine-grained parallelisms from small/medium size matrices used by GRU. Moreover, newer FPGAs with more DSPs, on-chip BRAMs, and higher frequency have the potential to narrow the FPGA-ASIC efficiency gap.
AbstractList	Recurrent neural networks (RNNs) provide state-of-the-art accuracy for performing analytics on datasets with sequence (e.g., language model). This paper studied a state-of-the-art RNN variant, Gated Recurrent Unit (GRU). We first proposed memoization optimization to avoid 3 out of the 6 dense matrix vector multiplications (SGEMVs) that are the majority of the computation in GRU. Then, we study the opportunities to accelerate the remaining SGEMVs using FPGAs, in comparison to 14-nm ASIC, GPU, and multi-core CPU. Results show that FPGA provides superior performance/Watt over CPU and GPU because FPGA's on-chip BRAMs, hard DSPs, and reconfigurable fabric allow for efficiently extracting fine-grained parallelisms from small/medium size matrices used by GRU. Moreover, newer FPGAs with more DSPs, on-chip BRAMs, and higher frequency have the potential to narrow the FPGA-ASIC efficiency gap.
Author	Jaewoong Sim Mishra, Asit Krishnan, Srivatsan Nurvitadhi, Eriko Sheffield, David Marr, Debbie
Author_xml	– sequence: 1 givenname: Eriko surname: Nurvitadhi fullname: Nurvitadhi, Eriko email: eriko.nurvitadhi@intel.com organization: Intel Corp., Hillsboro, OR, USA – sequence: 2 surname: Jaewoong Sim fullname: Jaewoong Sim organization: Intel Corp., Hillsboro, OR, USA – sequence: 3 givenname: David surname: Sheffield fullname: Sheffield, David organization: Intel Corp., Hillsboro, OR, USA – sequence: 4 givenname: Asit surname: Mishra fullname: Mishra, Asit organization: Intel Corp., Hillsboro, OR, USA – sequence: 5 givenname: Srivatsan surname: Krishnan fullname: Krishnan, Srivatsan organization: Intel Corp., Hillsboro, OR, USA – sequence: 6 givenname: Debbie surname: Marr fullname: Marr, Debbie organization: Intel Corp., Hillsboro, OR, USA
BookMark	eNotkEFLwzAYhqMoOOfugpf8gHX2S9Im8VaKm4OBA915fM0SiXbpSDpl_96Kew_vw3N5D-8tuQpdsITcQz4DyPXjfL2asRzKmSyk5CAuyERLxRTXGpQQ7JKMQIsyA6HUDZmk9JkPKYRURTkipjLGtjZi78MHjdYcY7Shp8EeI7YD-p8ufiXqA8WA7an3JtFk47eN6YnW3f6A0acu0M7R-XpRTWm93kzp4q8w7Gj1tqzvyLXDNtnJmWOymT-_1y_Z6nWxrKtV5pmAPmsa7px2yBuuBLBcyVLygsFOKEAlB-VgGmROGSkK64RjuOMoNAjkmgk-Jg__u95auz1Ev8d42p5v4b8rL1YT
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/FPL.2016.7577314
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9782839918442 2839918447
EISSN	1946-1488
EndPage	4
ExternalDocumentID	7577314
Genre	orig-research
GroupedDBID	6IE 6IF 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK OCL RIE RIL
ID	FETCH-LOGICAL-i241t-bb3ff9fa3b38412087673521d481a8787631cba2f8c745ef4f2ad3a4914a39243
IEDL.DBID	RIE
IngestDate	Wed Aug 27 01:40:02 EDT 2025
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i241t-bb3ff9fa3b38412087673521d481a8787631cba2f8c745ef4f2ad3a4914a39243
PageCount	4
ParticipantIDs	ieee_primary_7577314
PublicationCentury	2000
PublicationDate	2016-08
PublicationDateYYYYMMDD	2016-08-01
PublicationDate_xml	– month: 08 year: 2016 text: 2016-08
PublicationDecade	2010
PublicationTitle	International Conference on Field-programmable Logic and Applications
PublicationTitleAbbrev	FPL
PublicationYear	2016
Publisher	EPFL
Publisher_xml	– name: EPFL
SSID	ssj0000547856
Score	2.0734534
Snippet	Recurrent neural networks (RNNs) provide state-of-the-art accuracy for performing analytics on datasets with sequence (e.g., language model). This paper...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Classification algorithms Field programmable gate arrays Graphics processing units Logic gates Random access memory Recurrent neural networks Runtime
Title	Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC
URI	https://ieeexplore.ieee.org/document/7577314
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA5zJ08qm_ibHDyuXdskbeptFLspTgo62G3kpwylG7O7-Neb186K4sFLUwqhJQ_6vbx83_cQuqap-wXGkfZYKLRHU0M9ToX1TExVYBlVxIBQePoYT2b0fs7mHTRotTDGmJp8Zny4rc_y9UptoVQ2TFiSEOhavZfwuNFqtfWUAIypWHsSGaTDvHgA6lbs76b96J9Sw0d-gKZfL25YI6_-tpK--vjlyfjfLztE_W-hHi5aCDpCHVP2kBop5cAEQlu-4A0U1MGCCYN1pXhzQ038fsfLEguwJAGjZgzFWZcJ3uCsbUyIVxbnxXg0wFkxG-AxXESp8ejpLuujWX77nE28XS8Fb-kwuvKkJNamVhBJOA0jMKJLXO4VaspDwRPwpQuVFJHlKqHMWGojoYmgaUiFS6EoOUbdclWaE4RB-ZEGikkSC7d54dztxwNijSapDGNtT1EPFmixbuwyFru1Ofv78TnahyA1nLoL1K02W3PpcL6SV3WAPwEZwacN
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8NAEF1KPehJpRW_3YPHJk2ys_nwVoL90LYEbKG3stnsSlFSqenFX-9OUiOKBy_ZEFgSdiBvdva9N4TcQmR-gb6XWdwVmQWRAisEoS3lg3Q0B8kUCoUnU384h4cFXzRIp9bCKKVK8pmy8bY8y8_Wcoulsm7Ag4Bh1-o9DgC8UmvVFRUHral4fRbpRN1-Mkbylm_vJv7ooFICSP-QTL5eXfFGXuxtkdry45cr43-_7Yi0v6V6NKlB6Jg0VN4isielgRMMbv5MN1hSRxMmiuaV4tUMJfX7na5yKtCUBK2aKZZnTS54R-O6NSFda9pPBr0OjZN5hw7wIvKM9p5GcZvM-_ezeGjtuilYK4PShZWmTOtIC5ayEFwPregCk325GYSuCAN0pnNlKjwdygC40qA9kTEBkQvCJFHATkgzX-fqlFDUfkSO5Cnzhdm-hKHZkTtMq4xFqetn-oy0cIGWb5VhxnK3Nud_P74h-8PZZLwcj6aPF-QAA1Yx7C5Js9hs1ZVB_SK9LoP9CQpyqlo
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=International+Conference+on+Field-programmable+Logic+and+Applications&rft.atitle=Accelerating+recurrent+neural+networks+in+analytics+servers%3A+Comparison+of+FPGA%2C+CPU%2C+GPU%2C+and+ASIC&rft.au=Nurvitadhi%2C+Eriko&rft.au=Jaewoong+Sim&rft.au=Sheffield%2C+David&rft.au=Mishra%2C+Asit&rft.date=2016-08-01&rft.pub=EPFL&rft.eissn=1946-1488&rft.spage=1&rft.epage=4&rft_id=info:doi/10.1109%2FFPL.2016.7577314&rft.externalDocID=7577314