A Better and Faster end-to-end Model for Streaming ASR

End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) pp. 5634 - 5638
Main Authors	Li, Bo, Gulati, Anmol, Yu, Jiahui, Sainath, Tara N., Chiu, Chung-Cheng, Narayanan, Arun, Chang, Shuo-Yiin, Pang, Ruoming, He, Yanzhang, Qin, James, Han, Wei, Liang, Qiao, Zhang, Yu, Strohman, Trevor, Wu, Yonghui
Format	Conference Proceeding
Language	English
Published	IEEE 06.06.2021
Subjects	cascaded encoders Conformer latency Measurement uncertainty Prediction algorithms Predictive models RNN-T Signal processing Signal processing algorithms Speech recognition Transducers
Online Access	Get full text
ISSN	2379-190X
DOI	10.1109/ICASSP39728.2021.9413899

Cover

Abstract	End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1st-pass RNN-T decoder, an algorithm called Cascaded Encoders [5]. Overall, the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR.
AbstractList	End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1st-pass RNN-T decoder, an algorithm called Cascaded Encoders [5]. Overall, the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR.
Author	Chiu, Chung-Cheng Liang, Qiao Chang, Shuo-Yiin Zhang, Yu Qin, James Sainath, Tara N. Yu, Jiahui Strohman, Trevor Han, Wei Li, Bo Pang, Ruoming He, Yanzhang Gulati, Anmol Narayanan, Arun Wu, Yonghui
Author_xml	– sequence: 1 givenname: Bo surname: Li fullname: Li, Bo email: boboli@google.com organization: Google LLC,USA – sequence: 2 givenname: Anmol surname: Gulati fullname: Gulati, Anmol email: anmolgulati@google.com organization: Google LLC,USA – sequence: 3 givenname: Jiahui surname: Yu fullname: Yu, Jiahui email: jiahuiyu@google.com organization: Google LLC,USA – sequence: 4 givenname: Tara N. surname: Sainath fullname: Sainath, Tara N. organization: Google LLC,USA – sequence: 5 givenname: Chung-Cheng surname: Chiu fullname: Chiu, Chung-Cheng organization: Google LLC,USA – sequence: 6 givenname: Arun surname: Narayanan fullname: Narayanan, Arun organization: Google LLC,USA – sequence: 7 givenname: Shuo-Yiin surname: Chang fullname: Chang, Shuo-Yiin organization: Google LLC,USA – sequence: 8 givenname: Ruoming surname: Pang fullname: Pang, Ruoming organization: Google LLC,USA – sequence: 9 givenname: Yanzhang surname: He fullname: He, Yanzhang organization: Google LLC,USA – sequence: 10 givenname: James surname: Qin fullname: Qin, James organization: Google LLC,USA – sequence: 11 givenname: Wei surname: Han fullname: Han, Wei organization: Google LLC,USA – sequence: 12 givenname: Qiao surname: Liang fullname: Liang, Qiao organization: Google LLC,USA – sequence: 13 givenname: Yu surname: Zhang fullname: Zhang, Yu organization: Google LLC,USA – sequence: 14 givenname: Trevor surname: Strohman fullname: Strohman, Trevor organization: Google LLC,USA – sequence: 15 givenname: Yonghui surname: Wu fullname: Wu, Yonghui organization: Google LLC,USA
BookMark	eNotj8FOwzAQRA0Cibb0C7j4BxzW3ji2j6GiBamoFQGJW2XHaxTUJijJhb8niJ7eaA5PM3N21XYtMcYlZFKCu39elVW1R2eUzRQomblconXugi2dsXKqpSlA60s2U2ickA4-bth8GL4AwJrczlhR8gcaR-q5byNf--EvUhvF2IkJ_KWLdOSp63k19uRPTfvJy-r1ll0nfxxoeeaCva8f31ZPYrvbTKu2olGAo3BeaU02Jkp5sMZ7k2LCoJ2UWuWqBozBywh1qI0tDAZV21SkXGlEjy7ggt39exsiOnz3zcn3P4fzTfwFZfFIUg
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/ICASSP39728.2021.9413899
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISBN	9781728176055 1728176050
EISSN	2379-190X
EndPage	5638
ExternalDocumentID	9413899
Genre	orig-research
GroupedDBID	23M 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS
ID	FETCH-LOGICAL-i203t-9a255e8dfef4b87aa7fdf3b59115242c03dba1d0cbc78673b2c8f6f42533a39b3
IEDL.DBID	RIE
IngestDate	Wed Aug 27 02:39:02 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i203t-9a255e8dfef4b87aa7fdf3b59115242c03dba1d0cbc78673b2c8f6f42533a39b3
PageCount	5
ParticipantIDs	ieee_primary_9413899
PublicationCentury	2000
PublicationDate	2021-June-6
PublicationDateYYYYMMDD	2021-06-06
PublicationDate_xml	– month: 06 year: 2021 text: 2021-June-6 day: 06
PublicationDecade	2020
PublicationTitle	Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998)
PublicationTitleAbbrev	ICASSP
PublicationYear	2021
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0008748
Score	2.5570612
Snippet	End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including...
SourceID	ieee
SourceType	Publisher
StartPage	5634
SubjectTerms	cascaded encoders Conformer latency Measurement uncertainty Prediction algorithms Predictive models RNN-T Signal processing Signal processing algorithms Speech recognition Transducers
Title	A Better and Faster end-to-end Model for Streaming ASR
URI	https://ieeexplore.ieee.org/document/9413899
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3JTsMwELXanuDC0iJ2-cARp2nsxM6xVFQFqaiiVOqt8niREJBWKL3w9YzThUUcOMVKZCXxyH4ez5s3hFzJjo0z1zEMwOdMKJMyDRAO8iUIabW3lW7B8CEbTMT9NJ3WyPU2F8Y5V5HPXBSaVSzfzs0yHJW1cxHCanmd1KXKVrla21VXSaE2TJ04b9_1uuPxCME2CfytpBOt-_4oolJhSH-PDDdvX1FHXqJlCZH5-CXM-N_P2yetr2w9Otri0AGpueKQ7H4TGmySrEtvqrwdqgtL-zqoI1BXWFbOGV5oqIj2SnH_SkOUWr9hJ9odP7bIpH_71BuwdcUE9pzEvGS5Rg_BKeudF6Ck1tJbzyHFFS1FLDYxt6DROgYMjqDkkBjlM4_zlnPNc-BHpFHMC3dMqAR8KIwG5dB7Nil6kjzV2ikBuIfU8oQ0wwjMFitRjNn650__vn1GdoIVKo5Vdk4a5fvSXSCal3BZmfET4Z2eqQ
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LTwIxEJ4gHtSLDzS-7cGjXWDbbrtHJBJQIEQg4Ub6TIy6GLNc_PW2y8NHPHjaZjfNbnfSfp3ON98AXPO6qSW2rrFSLsVUaIalUuEgnyvKjXSm0C3o9ZP2mN5P2KQEN-tcGGttQT6zUWgWsXwz0_NwVFZNaQirpRuwySilbJGttV53BadixdWppdVOszEcDjzcxoHBFdejZe8fZVQKFGntQm_1_gV55Dma5yrSH7-kGf_7gXtw-JWvhwZrJNqHks0OYOeb1GAFkga6LTJ3kMwMasmgj4BsZnA-w_6CQk20F-R3sCjEqeWr74Qaw8dDGLfuRs02XtZMwE9xjeQ4ld5HsMI466gSXErujCOK-TWNeTTWNWKU9PbRSnORcKJiLVzi_MwlRJJUkSMoZ7PMHgPiyj-kWiphvf-smfclCZPSCqr8LlLyE6iEPzB9W8hiTJeDP_379hVstUe97rTb6T-cwXawSMG4Ss6hnL_P7YXH9lxdFib9BGS-ofY
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+...+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%281998%29&rft.atitle=A+Better+and+Faster+end-to-end+Model+for+Streaming+ASR&rft.au=Li%2C+Bo&rft.au=Gulati%2C+Anmol&rft.au=Yu%2C+Jiahui&rft.au=Sainath%2C+Tara+N.&rft.date=2021-06-06&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=5634&rft.epage=5638&rft_id=info:doi/10.1109%2FICASSP39728.2021.9413899&rft.externalDocID=9413899