Improved deep speaker feature learning for text-dependent speaker recognition

A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN). This approach has been applied to text-dependent speaker recognition tasks and shows reasonable performance gains when combined with the conventional i-vector approach. Althoug...

Full description

Saved in:

Bibliographic Details
Published in	2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) pp. 426 - 429
Main Authors	Lantian Li, Yiye Lin, Zhiyong Zhang, Dong Wang
Format	Conference Proceeding
Language	English
Published	Asia-Pacific Signal and Information Processing Association 01.12.2015
Subjects	d-vector Data models Machine learning Mel frequency cepstral coefficient Speaker recognition time dynamic warping Training Training data
Online Access	Get full text
DOI	10.1109/APSIPA.2015.7415306

Cover

More Information
Summary:	A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN). This approach has been applied to text-dependent speaker recognition tasks and shows reasonable performance gains when combined with the conventional i-vector approach. Although promising, the existing d-vector implementation still can not compete with the i-vector baseline. This paper presents two improvements for the deep learning approach: a phone-dependent DNN structure to normalize phone variation, and a new scoring approach based on dynamic time warping (DTW). Experiments on a text-dependent speaker recognition task demonstrated that the proposed methods can provide considerable performance improvement over the existing d-vector implementation.
DOI:	10.1109/APSIPA.2015.7415306