Improved deep speaker feature learning for text-dependent speaker recognition

A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN). This approach has been applied to text-dependent speaker recognition tasks and shows reasonable performance gains when combined with the conventional i-vector approach. Althoug...

Full description

Saved in:

Bibliographic Details
Published in	2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) pp. 426 - 429
Main Authors	Lantian Li, Yiye Lin, Zhiyong Zhang, Dong Wang
Format	Conference Proceeding
Language	English
Published	Asia-Pacific Signal and Information Processing Association 01.12.2015
Subjects	d-vector Data models Machine learning Mel frequency cepstral coefficient Speaker recognition time dynamic warping Training Training data
Online Access	Get full text
DOI	10.1109/APSIPA.2015.7415306

Cover

Abstract	A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN). This approach has been applied to text-dependent speaker recognition tasks and shows reasonable performance gains when combined with the conventional i-vector approach. Although promising, the existing d-vector implementation still can not compete with the i-vector baseline. This paper presents two improvements for the deep learning approach: a phone-dependent DNN structure to normalize phone variation, and a new scoring approach based on dynamic time warping (DTW). Experiments on a text-dependent speaker recognition task demonstrated that the proposed methods can provide considerable performance improvement over the existing d-vector implementation.
AbstractList	A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN). This approach has been applied to text-dependent speaker recognition tasks and shows reasonable performance gains when combined with the conventional i-vector approach. Although promising, the existing d-vector implementation still can not compete with the i-vector baseline. This paper presents two improvements for the deep learning approach: a phone-dependent DNN structure to normalize phone variation, and a new scoring approach based on dynamic time warping (DTW). Experiments on a text-dependent speaker recognition task demonstrated that the proposed methods can provide considerable performance improvement over the existing d-vector implementation.
Author	Zhiyong Zhang Dong Wang Lantian Li Yiye Lin
Author_xml	– sequence: 1 surname: Lantian Li fullname: Lantian Li email: lilt@cslt.riit.tsinghua.edu.cn organization: Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China – sequence: 2 surname: Yiye Lin fullname: Yiye Lin email: lyy@cslt.riit.tsinghua.edu.cn organization: Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China – sequence: 3 surname: Zhiyong Zhang fullname: Zhiyong Zhang email: zhangzy@cslt.riit.tsinghua.edu.cn organization: Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China – sequence: 4 surname: Dong Wang fullname: Dong Wang email: wangdong99@mails.tsinghua.edu.cn organization: Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
BookMark	eNo9j8tKAzEYRiPYha19gm7yAjPmMrkth-JloGKh7ss_kz8l2GaGNIq-vYilqw8OhwPfnNymMSEhK85qzpl7aLe7btvWgnFVm4YryfQNmTtreWO0ZfyOvHanKY9f6KlHnOh5QvjATANC-cxIjwg5xXSgYcy04HepPE6YPKZydTMO4yHFEsd0T2YBjmdcXnZBdk-P7-uXavP23K3bTRUdKxU41aO3QTAzMC20kUYorRoJDmSj_6ATQaAEazz2rLdsAA1GKx-EFHJBVv_ViIj7KccT5J_95Z_8BQLRSx0
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/APSIPA.2015.7415306
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Xplore Digital Library (LUT) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9881476801 9789881476807
EndPage	429
ExternalDocumentID	7415306
Genre	orig-research
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-i90t-a95bed8f207c0626737256543a9a3467c0692f2e3a87deb0b80ca6a765df2323
IEDL.DBID	RIE
IngestDate	Thu Jun 29 18:36:27 EDT 2023
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i90t-a95bed8f207c0626737256543a9a3467c0692f2e3a87deb0b80ca6a765df2323
PageCount	4
ParticipantIDs	ieee_primary_7415306
PublicationCentury	2000
PublicationDate	2015-Dec.
PublicationDateYYYYMMDD	2015-12-01
PublicationDate_xml	– month: 12 year: 2015 text: 2015-Dec.
PublicationDecade	2010
PublicationTitle	2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)
PublicationTitleAbbrev	APSIPA
PublicationYear	2015
Publisher	Asia-Pacific Signal and Information Processing Association
Publisher_xml	– name: Asia-Pacific Signal and Information Processing Association
Score	1.7176154
Snippet	A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN). This approach has been applied to...
SourceID	ieee
SourceType	Publisher
StartPage	426
SubjectTerms	d-vector Data models Machine learning Mel frequency cepstral coefficient Speaker recognition time dynamic warping Training Training data
Title	Improved deep speaker feature learning for text-dependent speaker recognition
URI	https://ieeexplore.ieee.org/document/7415306
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8MwGA5zJ08qm_hNDh5Nl7VN0xyHOKYwGUxht5E0b0QG7ZjtxV9v3rarKB68hRBomkCfN-nzQcitkkYgbDGQKmSx44KZVPoWOF8sWwRVVCPPn5PZa_y0Eqseueu0MABQk88gwGb9L98WWYVXZSNEvwj9tQ9kmjRardZIaMzVaLJYPi4myNYSQTvyR2RKjRjTIzLfP6shimyCqjRB9vnLhvG_kzkmw29tHl10qHNCepAPyLy5GwBLLcCWfmxBb2BHHdS2nbSNhnijvkKlSPVg--zbshvbMYmKfEiW04eX-xlrgxLYu-Il00oYsKkLucy4P6Bg8oxAzahWOvIfQt-pQhdCpFNpwXCT8kwnWibCOl9QRaeknxc5nBEKsa-npOROOl8nZdZYKVP_5hrTb83YnJMBLsR62zhhrNs1uPi7-5Ic4mY05I8r0i93FVx7CC_NTb13X160niM
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8MwGA5jHvSksonf5uDRdFnbNM1xiGPTdQw2YbeRNG9EBl2Z3cVfb9J2FcWDtxACbd9An7fp84HQveCKOdgiwIVPQkMZUTG3IzC2WdYOVJ0aOZlGo9fwecmWLfTQaGEAoCSfgeeG5b98vUl37qis59AvcP7aBywMQ1aptWoroT4VvcFsPp4NHF-LefXaH6EpJWYMj1Gyv1pFFVl7u0J56ecvI8b_3s4J6n6r8_CswZ1T1IKsg5LqdAA01gA5_shBrmGLDZTGnbgOh3jDtkfFjuxB9um3RbO24RJtsi6aD58WjyNSRyWQd0ELIgVToGPjU55S-4nismeYU41KIQP7KrSTwjc-BDLmGhRVMU1lJHnEtLEtVXCG2tkmg3OEwdY14pwabmynlGqlOY_tk0uXf6v66gJ1XCFWeeWFsaprcPn39B06HC2SyWoynr5coSO3MRUV5Bq1i-0ObiygF-q23Mcv1MuhcA
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2015+Asia-Pacific+Signal+and+Information+Processing+Association+Annual+Summit+and+Conference+%28APSIPA%29&rft.atitle=Improved+deep+speaker+feature+learning+for+text-dependent+speaker+recognition&rft.au=Lantian+Li&rft.au=Yiye+Lin&rft.au=Zhiyong+Zhang&rft.au=Dong+Wang&rft.date=2015-12-01&rft.pub=Asia-Pacific+Signal+and+Information+Processing+Association&rft.spage=426&rft.epage=429&rft_id=info:doi/10.1109%2FAPSIPA.2015.7415306&rft.externalDocID=7415306