A Better and Faster end-to-end Model for Streaming ASR

End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the...

Full description

Saved in:
Bibliographic Details
Published inProceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) pp. 5634 - 5638
Main Authors Li, Bo, Gulati, Anmol, Yu, Jiahui, Sainath, Tara N., Chiu, Chung-Cheng, Narayanan, Arun, Chang, Shuo-Yiin, Pang, Ruoming, He, Yanzhang, Qin, James, Han, Wei, Liang, Qiao, Zhang, Yu, Strohman, Trevor, Wu, Yonghui
Format Conference Proceeding
LanguageEnglish
Published IEEE 06.06.2021
Subjects
Online AccessGet full text
ISSN2379-190X
DOI10.1109/ICASSP39728.2021.9413899

Cover

Abstract End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1st-pass RNN-T decoder, an algorithm called Cascaded Encoders [5]. Overall, the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR.
AbstractList End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1st-pass RNN-T decoder, an algorithm called Cascaded Encoders [5]. Overall, the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR.
Author Chiu, Chung-Cheng
Liang, Qiao
Chang, Shuo-Yiin
Zhang, Yu
Qin, James
Sainath, Tara N.
Yu, Jiahui
Strohman, Trevor
Han, Wei
Li, Bo
Pang, Ruoming
He, Yanzhang
Gulati, Anmol
Narayanan, Arun
Wu, Yonghui
Author_xml – sequence: 1
  givenname: Bo
  surname: Li
  fullname: Li, Bo
  email: boboli@google.com
  organization: Google LLC,USA
– sequence: 2
  givenname: Anmol
  surname: Gulati
  fullname: Gulati, Anmol
  email: anmolgulati@google.com
  organization: Google LLC,USA
– sequence: 3
  givenname: Jiahui
  surname: Yu
  fullname: Yu, Jiahui
  email: jiahuiyu@google.com
  organization: Google LLC,USA
– sequence: 4
  givenname: Tara N.
  surname: Sainath
  fullname: Sainath, Tara N.
  organization: Google LLC,USA
– sequence: 5
  givenname: Chung-Cheng
  surname: Chiu
  fullname: Chiu, Chung-Cheng
  organization: Google LLC,USA
– sequence: 6
  givenname: Arun
  surname: Narayanan
  fullname: Narayanan, Arun
  organization: Google LLC,USA
– sequence: 7
  givenname: Shuo-Yiin
  surname: Chang
  fullname: Chang, Shuo-Yiin
  organization: Google LLC,USA
– sequence: 8
  givenname: Ruoming
  surname: Pang
  fullname: Pang, Ruoming
  organization: Google LLC,USA
– sequence: 9
  givenname: Yanzhang
  surname: He
  fullname: He, Yanzhang
  organization: Google LLC,USA
– sequence: 10
  givenname: James
  surname: Qin
  fullname: Qin, James
  organization: Google LLC,USA
– sequence: 11
  givenname: Wei
  surname: Han
  fullname: Han, Wei
  organization: Google LLC,USA
– sequence: 12
  givenname: Qiao
  surname: Liang
  fullname: Liang, Qiao
  organization: Google LLC,USA
– sequence: 13
  givenname: Yu
  surname: Zhang
  fullname: Zhang, Yu
  organization: Google LLC,USA
– sequence: 14
  givenname: Trevor
  surname: Strohman
  fullname: Strohman, Trevor
  organization: Google LLC,USA
– sequence: 15
  givenname: Yonghui
  surname: Wu
  fullname: Wu, Yonghui
  organization: Google LLC,USA
BookMark eNotj8FOwzAQRA0Cibb0C7j4BxzW3ji2j6GiBamoFQGJW2XHaxTUJijJhb8niJ7eaA5PM3N21XYtMcYlZFKCu39elVW1R2eUzRQomblconXugi2dsXKqpSlA60s2U2ickA4-bth8GL4AwJrczlhR8gcaR-q5byNf--EvUhvF2IkJ_KWLdOSp63k19uRPTfvJy-r1ll0nfxxoeeaCva8f31ZPYrvbTKu2olGAo3BeaU02Jkp5sMZ7k2LCoJ2UWuWqBozBywh1qI0tDAZV21SkXGlEjy7ggt39exsiOnz3zcn3P4fzTfwFZfFIUg
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICASSP39728.2021.9413899
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISBN 9781728176055
1728176050
EISSN 2379-190X
EndPage 5638
ExternalDocumentID 9413899
Genre orig-research
GroupedDBID 23M
6IE
6IF
6IH
6IK
6IL
6IM
6IN
AAJGR
AAWTH
ABLEC
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
RNS
ID FETCH-LOGICAL-i203t-9a255e8dfef4b87aa7fdf3b59115242c03dba1d0cbc78673b2c8f6f42533a39b3
IEDL.DBID RIE
IngestDate Wed Aug 27 02:39:02 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-9a255e8dfef4b87aa7fdf3b59115242c03dba1d0cbc78673b2c8f6f42533a39b3
PageCount 5
ParticipantIDs ieee_primary_9413899
PublicationCentury 2000
PublicationDate 2021-June-6
PublicationDateYYYYMMDD 2021-06-06
PublicationDate_xml – month: 06
  year: 2021
  text: 2021-June-6
  day: 06
PublicationDecade 2020
PublicationTitle Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998)
PublicationTitleAbbrev ICASSP
PublicationYear 2021
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0008748
Score 2.5570612
Snippet End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including...
SourceID ieee
SourceType Publisher
StartPage 5634
SubjectTerms cascaded encoders
Conformer
latency
Measurement uncertainty
Prediction algorithms
Predictive models
RNN-T
Signal processing
Signal processing algorithms
Speech recognition
Transducers
Title A Better and Faster end-to-end Model for Streaming ASR
URI https://ieeexplore.ieee.org/document/9413899
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3JTsMwELXanuDC0iJ2-cARp2nsxM6xVFQFqaiiVOqt8niREJBWKL3w9YzThUUcOMVKZCXxyH4ez5s3hFzJjo0z1zEMwOdMKJMyDRAO8iUIabW3lW7B8CEbTMT9NJ3WyPU2F8Y5V5HPXBSaVSzfzs0yHJW1cxHCanmd1KXKVrla21VXSaE2TJ04b9_1uuPxCME2CfytpBOt-_4oolJhSH-PDDdvX1FHXqJlCZH5-CXM-N_P2yetr2w9Otri0AGpueKQ7H4TGmySrEtvqrwdqgtL-zqoI1BXWFbOGV5oqIj2SnH_SkOUWr9hJ9odP7bIpH_71BuwdcUE9pzEvGS5Rg_BKeudF6Ck1tJbzyHFFS1FLDYxt6DROgYMjqDkkBjlM4_zlnPNc-BHpFHMC3dMqAR8KIwG5dB7Nil6kjzV2ikBuIfU8oQ0wwjMFitRjNn650__vn1GdoIVKo5Vdk4a5fvSXSCal3BZmfET4Z2eqQ
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LTwIxEJ4gHtSLDzS-7cGjXWDbbrtHJBJQIEQg4Ub6TIy6GLNc_PW2y8NHPHjaZjfNbnfSfp3ON98AXPO6qSW2rrFSLsVUaIalUuEgnyvKjXSm0C3o9ZP2mN5P2KQEN-tcGGttQT6zUWgWsXwz0_NwVFZNaQirpRuwySilbJGttV53BadixdWppdVOszEcDjzcxoHBFdejZe8fZVQKFGntQm_1_gV55Dma5yrSH7-kGf_7gXtw-JWvhwZrJNqHks0OYOeb1GAFkga6LTJ3kMwMasmgj4BsZnA-w_6CQk20F-R3sCjEqeWr74Qaw8dDGLfuRs02XtZMwE9xjeQ4ld5HsMI466gSXErujCOK-TWNeTTWNWKU9PbRSnORcKJiLVzi_MwlRJJUkSMoZ7PMHgPiyj-kWiphvf-smfclCZPSCqr8LlLyE6iEPzB9W8hiTJeDP_379hVstUe97rTb6T-cwXawSMG4Ss6hnL_P7YXH9lxdFib9BGS-ofY
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+...+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%281998%29&rft.atitle=A+Better+and+Faster+end-to-end+Model+for+Streaming+ASR&rft.au=Li%2C+Bo&rft.au=Gulati%2C+Anmol&rft.au=Yu%2C+Jiahui&rft.au=Sainath%2C+Tara+N.&rft.date=2021-06-06&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=5634&rft.epage=5638&rft_id=info:doi/10.1109%2FICASSP39728.2021.9413899&rft.externalDocID=9413899