A Better and Faster end-to-end Model for Streaming ASR

End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) pp. 5634 - 5638
Main Authors	Li, Bo, Gulati, Anmol, Yu, Jiahui, Sainath, Tara N., Chiu, Chung-Cheng, Narayanan, Arun, Chang, Shuo-Yiin, Pang, Ruoming, He, Yanzhang, Qin, James, Han, Wei, Liang, Qiao, Zhang, Yu, Strohman, Trevor, Wu, Yonghui
Format	Conference Proceeding
Language	English
Published	IEEE 06.06.2021
Subjects	cascaded encoders Conformer latency Measurement uncertainty Prediction algorithms Predictive models RNN-T Signal processing Signal processing algorithms Speech recognition Transducers
Online Access	Get full text
ISSN	2379-190X
DOI	10.1109/ICASSP39728.2021.9413899

Cover

More Information
Summary:	End-to-end (E2E) models have shown to outperform state-of-the-art conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1st-pass RNN-T decoder, an algorithm called Cascaded Encoders [5]. Overall, the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR.
ISSN:	2379-190X
DOI:	10.1109/ICASSP39728.2021.9413899