Improved deep speaker feature learning for text-dependent speaker recognition

A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN). This approach has been applied to text-dependent speaker recognition tasks and shows reasonable performance gains when combined with the conventional i-vector approach. Althoug...

Full description

Saved in:
Bibliographic Details
Published in2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) pp. 426 - 429
Main Authors Lantian Li, Yiye Lin, Zhiyong Zhang, Dong Wang
Format Conference Proceeding
LanguageEnglish
Published Asia-Pacific Signal and Information Processing Association 01.12.2015
Subjects
Online AccessGet full text
DOI10.1109/APSIPA.2015.7415306

Cover

Abstract A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN). This approach has been applied to text-dependent speaker recognition tasks and shows reasonable performance gains when combined with the conventional i-vector approach. Although promising, the existing d-vector implementation still can not compete with the i-vector baseline. This paper presents two improvements for the deep learning approach: a phone-dependent DNN structure to normalize phone variation, and a new scoring approach based on dynamic time warping (DTW). Experiments on a text-dependent speaker recognition task demonstrated that the proposed methods can provide considerable performance improvement over the existing d-vector implementation.
AbstractList A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN). This approach has been applied to text-dependent speaker recognition tasks and shows reasonable performance gains when combined with the conventional i-vector approach. Although promising, the existing d-vector implementation still can not compete with the i-vector baseline. This paper presents two improvements for the deep learning approach: a phone-dependent DNN structure to normalize phone variation, and a new scoring approach based on dynamic time warping (DTW). Experiments on a text-dependent speaker recognition task demonstrated that the proposed methods can provide considerable performance improvement over the existing d-vector implementation.
Author Zhiyong Zhang
Dong Wang
Lantian Li
Yiye Lin
Author_xml – sequence: 1
  surname: Lantian Li
  fullname: Lantian Li
  email: lilt@cslt.riit.tsinghua.edu.cn
  organization: Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
– sequence: 2
  surname: Yiye Lin
  fullname: Yiye Lin
  email: lyy@cslt.riit.tsinghua.edu.cn
  organization: Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
– sequence: 3
  surname: Zhiyong Zhang
  fullname: Zhiyong Zhang
  email: zhangzy@cslt.riit.tsinghua.edu.cn
  organization: Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
– sequence: 4
  surname: Dong Wang
  fullname: Dong Wang
  email: wangdong99@mails.tsinghua.edu.cn
  organization: Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
BookMark eNo9j8tKAzEYRiPYha19gm7yAjPmMrkth-JloGKh7ss_kz8l2GaGNIq-vYilqw8OhwPfnNymMSEhK85qzpl7aLe7btvWgnFVm4YryfQNmTtreWO0ZfyOvHanKY9f6KlHnOh5QvjATANC-cxIjwg5xXSgYcy04HepPE6YPKZydTMO4yHFEsd0T2YBjmdcXnZBdk-P7-uXavP23K3bTRUdKxU41aO3QTAzMC20kUYorRoJDmSj_6ATQaAEazz2rLdsAA1GKx-EFHJBVv_ViIj7KccT5J_95Z_8BQLRSx0
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/APSIPA.2015.7415306
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Xplore Digital Library (LUT)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9881476801
9789881476807
EndPage 429
ExternalDocumentID 7415306
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i90t-a95bed8f207c0626737256543a9a3467c0692f2e3a87deb0b80ca6a765df2323
IEDL.DBID RIE
IngestDate Thu Jun 29 18:36:27 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i90t-a95bed8f207c0626737256543a9a3467c0692f2e3a87deb0b80ca6a765df2323
PageCount 4
ParticipantIDs ieee_primary_7415306
PublicationCentury 2000
PublicationDate 2015-Dec.
PublicationDateYYYYMMDD 2015-12-01
PublicationDate_xml – month: 12
  year: 2015
  text: 2015-Dec.
PublicationDecade 2010
PublicationTitle 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)
PublicationTitleAbbrev APSIPA
PublicationYear 2015
Publisher Asia-Pacific Signal and Information Processing Association
Publisher_xml – name: Asia-Pacific Signal and Information Processing Association
Score 1.7176154
Snippet A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN). This approach has been applied to...
SourceID ieee
SourceType Publisher
StartPage 426
SubjectTerms d-vector
Data models
Machine learning
Mel frequency cepstral coefficient
Speaker recognition
time dynamic warping
Training
Training data
Title Improved deep speaker feature learning for text-dependent speaker recognition
URI https://ieeexplore.ieee.org/document/7415306
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8MwGA5zJ08qm_hNDh5Nl7VN0xyHOKYwGUxht5E0b0QG7ZjtxV9v3rarKB68hRBomkCfN-nzQcitkkYgbDGQKmSx44KZVPoWOF8sWwRVVCPPn5PZa_y0Eqseueu0MABQk88gwGb9L98WWYVXZSNEvwj9tQ9kmjRardZIaMzVaLJYPi4myNYSQTvyR2RKjRjTIzLfP6shimyCqjRB9vnLhvG_kzkmw29tHl10qHNCepAPyLy5GwBLLcCWfmxBb2BHHdS2nbSNhnijvkKlSPVg--zbshvbMYmKfEiW04eX-xlrgxLYu-Il00oYsKkLucy4P6Bg8oxAzahWOvIfQt-pQhdCpFNpwXCT8kwnWibCOl9QRaeknxc5nBEKsa-npOROOl8nZdZYKVP_5hrTb83YnJMBLsR62zhhrNs1uPi7-5Ic4mY05I8r0i93FVx7CC_NTb13X160niM
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8MwGA5jHvSksonf5uDRdFnbNM1xiGPTdQw2YbeRNG9EBl2Z3cVfb9J2FcWDtxACbd9An7fp84HQveCKOdgiwIVPQkMZUTG3IzC2WdYOVJ0aOZlGo9fwecmWLfTQaGEAoCSfgeeG5b98vUl37qis59AvcP7aBywMQ1aptWoroT4VvcFsPp4NHF-LefXaH6EpJWYMj1Gyv1pFFVl7u0J56ecvI8b_3s4J6n6r8_CswZ1T1IKsg5LqdAA01gA5_shBrmGLDZTGnbgOh3jDtkfFjuxB9um3RbO24RJtsi6aD58WjyNSRyWQd0ELIgVToGPjU55S-4nismeYU41KIQP7KrSTwjc-BDLmGhRVMU1lJHnEtLEtVXCG2tkmg3OEwdY14pwabmynlGqlOY_tk0uXf6v66gJ1XCFWeeWFsaprcPn39B06HC2SyWoynr5coSO3MRUV5Bq1i-0ObiygF-q23Mcv1MuhcA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2015+Asia-Pacific+Signal+and+Information+Processing+Association+Annual+Summit+and+Conference+%28APSIPA%29&rft.atitle=Improved+deep+speaker+feature+learning+for+text-dependent+speaker+recognition&rft.au=Lantian+Li&rft.au=Yiye+Lin&rft.au=Zhiyong+Zhang&rft.au=Dong+Wang&rft.date=2015-12-01&rft.pub=Asia-Pacific+Signal+and+Information+Processing+Association&rft.spage=426&rft.epage=429&rft_id=info:doi/10.1109%2FAPSIPA.2015.7415306&rft.externalDocID=7415306