A Better and Faster end-to-end Model for Streaming ASR
End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the...
Saved in:
| Published in | Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) pp. 5634 - 5638 |
|---|---|
| Main Authors | , , , , , , , , , , , , , , |
| Format | Conference Proceeding |
| Language | English |
| Published |
IEEE
06.06.2021
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 2379-190X |
| DOI | 10.1109/ICASSP39728.2021.9413899 |
Cover
| Abstract | End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1st-pass RNN-T decoder, an algorithm called Cascaded Encoders [5]. Overall, the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR. |
|---|---|
| AbstractList | End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1st-pass RNN-T decoder, an algorithm called Cascaded Encoders [5]. Overall, the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR. |
| Author | Chiu, Chung-Cheng Liang, Qiao Chang, Shuo-Yiin Zhang, Yu Qin, James Sainath, Tara N. Yu, Jiahui Strohman, Trevor Han, Wei Li, Bo Pang, Ruoming He, Yanzhang Gulati, Anmol Narayanan, Arun Wu, Yonghui |
| Author_xml | – sequence: 1 givenname: Bo surname: Li fullname: Li, Bo email: boboli@google.com organization: Google LLC,USA – sequence: 2 givenname: Anmol surname: Gulati fullname: Gulati, Anmol email: anmolgulati@google.com organization: Google LLC,USA – sequence: 3 givenname: Jiahui surname: Yu fullname: Yu, Jiahui email: jiahuiyu@google.com organization: Google LLC,USA – sequence: 4 givenname: Tara N. surname: Sainath fullname: Sainath, Tara N. organization: Google LLC,USA – sequence: 5 givenname: Chung-Cheng surname: Chiu fullname: Chiu, Chung-Cheng organization: Google LLC,USA – sequence: 6 givenname: Arun surname: Narayanan fullname: Narayanan, Arun organization: Google LLC,USA – sequence: 7 givenname: Shuo-Yiin surname: Chang fullname: Chang, Shuo-Yiin organization: Google LLC,USA – sequence: 8 givenname: Ruoming surname: Pang fullname: Pang, Ruoming organization: Google LLC,USA – sequence: 9 givenname: Yanzhang surname: He fullname: He, Yanzhang organization: Google LLC,USA – sequence: 10 givenname: James surname: Qin fullname: Qin, James organization: Google LLC,USA – sequence: 11 givenname: Wei surname: Han fullname: Han, Wei organization: Google LLC,USA – sequence: 12 givenname: Qiao surname: Liang fullname: Liang, Qiao organization: Google LLC,USA – sequence: 13 givenname: Yu surname: Zhang fullname: Zhang, Yu organization: Google LLC,USA – sequence: 14 givenname: Trevor surname: Strohman fullname: Strohman, Trevor organization: Google LLC,USA – sequence: 15 givenname: Yonghui surname: Wu fullname: Wu, Yonghui organization: Google LLC,USA |
| BookMark | eNotj8FOwzAQRA0Cibb0C7j4BxzW3ji2j6GiBamoFQGJW2XHaxTUJijJhb8niJ7eaA5PM3N21XYtMcYlZFKCu39elVW1R2eUzRQomblconXugi2dsXKqpSlA60s2U2ickA4-bth8GL4AwJrczlhR8gcaR-q5byNf--EvUhvF2IkJ_KWLdOSp63k19uRPTfvJy-r1ll0nfxxoeeaCva8f31ZPYrvbTKu2olGAo3BeaU02Jkp5sMZ7k2LCoJ2UWuWqBozBywh1qI0tDAZV21SkXGlEjy7ggt39exsiOnz3zcn3P4fzTfwFZfFIUg |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1109/ICASSP39728.2021.9413899 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering |
| EISBN | 9781728176055 1728176050 |
| EISSN | 2379-190X |
| EndPage | 5638 |
| ExternalDocumentID | 9413899 |
| Genre | orig-research |
| GroupedDBID | 23M 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS |
| ID | FETCH-LOGICAL-i203t-9a255e8dfef4b87aa7fdf3b59115242c03dba1d0cbc78673b2c8f6f42533a39b3 |
| IEDL.DBID | RIE |
| IngestDate | Wed Aug 27 02:39:02 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i203t-9a255e8dfef4b87aa7fdf3b59115242c03dba1d0cbc78673b2c8f6f42533a39b3 |
| PageCount | 5 |
| ParticipantIDs | ieee_primary_9413899 |
| PublicationCentury | 2000 |
| PublicationDate | 2021-June-6 |
| PublicationDateYYYYMMDD | 2021-06-06 |
| PublicationDate_xml | – month: 06 year: 2021 text: 2021-June-6 day: 06 |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) |
| PublicationTitleAbbrev | ICASSP |
| PublicationYear | 2021 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0008748 |
| Score | 2.5570612 |
| Snippet | End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 5634 |
| SubjectTerms | cascaded encoders Conformer latency Measurement uncertainty Prediction algorithms Predictive models RNN-T Signal processing Signal processing algorithms Speech recognition Transducers |
| Title | A Better and Faster end-to-end Model for Streaming ASR |
| URI | https://ieeexplore.ieee.org/document/9413899 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3JTsMwELXanuDC0iJ2-cARp2nsxM6xVFQFqaiiVOqt8niREJBWKL3w9YzThUUcOMVKZCXxyH4ez5s3hFzJjo0z1zEMwOdMKJMyDRAO8iUIabW3lW7B8CEbTMT9NJ3WyPU2F8Y5V5HPXBSaVSzfzs0yHJW1cxHCanmd1KXKVrla21VXSaE2TJ04b9_1uuPxCME2CfytpBOt-_4oolJhSH-PDDdvX1FHXqJlCZH5-CXM-N_P2yetr2w9Otri0AGpueKQ7H4TGmySrEtvqrwdqgtL-zqoI1BXWFbOGV5oqIj2SnH_SkOUWr9hJ9odP7bIpH_71BuwdcUE9pzEvGS5Rg_BKeudF6Ck1tJbzyHFFS1FLDYxt6DROgYMjqDkkBjlM4_zlnPNc-BHpFHMC3dMqAR8KIwG5dB7Nil6kjzV2ikBuIfU8oQ0wwjMFitRjNn650__vn1GdoIVKo5Vdk4a5fvSXSCal3BZmfET4Z2eqQ |
| linkProvider | IEEE |
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LTwIxEJ4gHtSLDzS-7cGjXWDbbrtHJBJQIEQg4Ub6TIy6GLNc_PW2y8NHPHjaZjfNbnfSfp3ON98AXPO6qSW2rrFSLsVUaIalUuEgnyvKjXSm0C3o9ZP2mN5P2KQEN-tcGGttQT6zUWgWsXwz0_NwVFZNaQirpRuwySilbJGttV53BadixdWppdVOszEcDjzcxoHBFdejZe8fZVQKFGntQm_1_gV55Dma5yrSH7-kGf_7gXtw-JWvhwZrJNqHks0OYOeb1GAFkga6LTJ3kMwMasmgj4BsZnA-w_6CQk20F-R3sCjEqeWr74Qaw8dDGLfuRs02XtZMwE9xjeQ4ld5HsMI466gSXErujCOK-TWNeTTWNWKU9PbRSnORcKJiLVzi_MwlRJJUkSMoZ7PMHgPiyj-kWiphvf-smfclCZPSCqr8LlLyE6iEPzB9W8hiTJeDP_379hVstUe97rTb6T-cwXawSMG4Ss6hnL_P7YXH9lxdFib9BGS-ofY |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+...+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%281998%29&rft.atitle=A+Better+and+Faster+end-to-end+Model+for+Streaming+ASR&rft.au=Li%2C+Bo&rft.au=Gulati%2C+Anmol&rft.au=Yu%2C+Jiahui&rft.au=Sainath%2C+Tara+N.&rft.date=2021-06-06&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=5634&rft.epage=5638&rft_id=info:doi/10.1109%2FICASSP39728.2021.9413899&rft.externalDocID=9413899 |