Improved deep speaker feature learning for text-dependent speaker recognition
A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN). This approach has been applied to text-dependent speaker recognition tasks and shows reasonable performance gains when combined with the conventional i-vector approach. Althoug...
Saved in:
| Published in | 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) pp. 426 - 429 |
|---|---|
| Main Authors | , , , |
| Format | Conference Proceeding |
| Language | English |
| Published |
Asia-Pacific Signal and Information Processing Association
01.12.2015
|
| Subjects | |
| Online Access | Get full text |
| DOI | 10.1109/APSIPA.2015.7415306 |
Cover
| Abstract | A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN). This approach has been applied to text-dependent speaker recognition tasks and shows reasonable performance gains when combined with the conventional i-vector approach. Although promising, the existing d-vector implementation still can not compete with the i-vector baseline. This paper presents two improvements for the deep learning approach: a phone-dependent DNN structure to normalize phone variation, and a new scoring approach based on dynamic time warping (DTW). Experiments on a text-dependent speaker recognition task demonstrated that the proposed methods can provide considerable performance improvement over the existing d-vector implementation. |
|---|---|
| AbstractList | A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN). This approach has been applied to text-dependent speaker recognition tasks and shows reasonable performance gains when combined with the conventional i-vector approach. Although promising, the existing d-vector implementation still can not compete with the i-vector baseline. This paper presents two improvements for the deep learning approach: a phone-dependent DNN structure to normalize phone variation, and a new scoring approach based on dynamic time warping (DTW). Experiments on a text-dependent speaker recognition task demonstrated that the proposed methods can provide considerable performance improvement over the existing d-vector implementation. |
| Author | Zhiyong Zhang Dong Wang Lantian Li Yiye Lin |
| Author_xml | – sequence: 1 surname: Lantian Li fullname: Lantian Li email: lilt@cslt.riit.tsinghua.edu.cn organization: Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China – sequence: 2 surname: Yiye Lin fullname: Yiye Lin email: lyy@cslt.riit.tsinghua.edu.cn organization: Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China – sequence: 3 surname: Zhiyong Zhang fullname: Zhiyong Zhang email: zhangzy@cslt.riit.tsinghua.edu.cn organization: Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China – sequence: 4 surname: Dong Wang fullname: Dong Wang email: wangdong99@mails.tsinghua.edu.cn organization: Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China |
| BookMark | eNo9j8tKAzEYRiPYha19gm7yAjPmMrkth-JloGKh7ss_kz8l2GaGNIq-vYilqw8OhwPfnNymMSEhK85qzpl7aLe7btvWgnFVm4YryfQNmTtreWO0ZfyOvHanKY9f6KlHnOh5QvjATANC-cxIjwg5xXSgYcy04HepPE6YPKZydTMO4yHFEsd0T2YBjmdcXnZBdk-P7-uXavP23K3bTRUdKxU41aO3QTAzMC20kUYorRoJDmSj_6ATQaAEazz2rLdsAA1GKx-EFHJBVv_ViIj7KccT5J_95Z_8BQLRSx0 |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/APSIPA.2015.7415306 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Xplore Digital Library (LUT) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 9881476801 9789881476807 |
| EndPage | 429 |
| ExternalDocumentID | 7415306 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IL CBEJK RIE RIL |
| ID | FETCH-LOGICAL-i90t-a95bed8f207c0626737256543a9a3467c0692f2e3a87deb0b80ca6a765df2323 |
| IEDL.DBID | RIE |
| IngestDate | Thu Jun 29 18:36:27 EDT 2023 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i90t-a95bed8f207c0626737256543a9a3467c0692f2e3a87deb0b80ca6a765df2323 |
| PageCount | 4 |
| ParticipantIDs | ieee_primary_7415306 |
| PublicationCentury | 2000 |
| PublicationDate | 2015-Dec. |
| PublicationDateYYYYMMDD | 2015-12-01 |
| PublicationDate_xml | – month: 12 year: 2015 text: 2015-Dec. |
| PublicationDecade | 2010 |
| PublicationTitle | 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) |
| PublicationTitleAbbrev | APSIPA |
| PublicationYear | 2015 |
| Publisher | Asia-Pacific Signal and Information Processing Association |
| Publisher_xml | – name: Asia-Pacific Signal and Information Processing Association |
| Score | 1.7176154 |
| Snippet | A deep learning approach has been proposed recently to derive speaker identifies (d-vector) by a deep neural network (DNN). This approach has been applied to... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 426 |
| SubjectTerms | d-vector Data models Machine learning Mel frequency cepstral coefficient Speaker recognition time dynamic warping Training Training data |
| Title | Improved deep speaker feature learning for text-dependent speaker recognition |
| URI | https://ieeexplore.ieee.org/document/7415306 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8MwGA5zJ08qm_hNDh5Nl7VN0xyHOKYwGUxht5E0b0QG7ZjtxV9v3rarKB68hRBomkCfN-nzQcitkkYgbDGQKmSx44KZVPoWOF8sWwRVVCPPn5PZa_y0Eqseueu0MABQk88gwGb9L98WWYVXZSNEvwj9tQ9kmjRardZIaMzVaLJYPi4myNYSQTvyR2RKjRjTIzLfP6shimyCqjRB9vnLhvG_kzkmw29tHl10qHNCepAPyLy5GwBLLcCWfmxBb2BHHdS2nbSNhnijvkKlSPVg--zbshvbMYmKfEiW04eX-xlrgxLYu-Il00oYsKkLucy4P6Bg8oxAzahWOvIfQt-pQhdCpFNpwXCT8kwnWibCOl9QRaeknxc5nBEKsa-npOROOl8nZdZYKVP_5hrTb83YnJMBLsR62zhhrNs1uPi7-5Ic4mY05I8r0i93FVx7CC_NTb13X160niM |
| linkProvider | IEEE |
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8MwGA5jHvSksonf5uDRdFnbNM1xiGPTdQw2YbeRNG9EBl2Z3cVfb9J2FcWDtxACbd9An7fp84HQveCKOdgiwIVPQkMZUTG3IzC2WdYOVJ0aOZlGo9fwecmWLfTQaGEAoCSfgeeG5b98vUl37qis59AvcP7aBywMQ1aptWoroT4VvcFsPp4NHF-LefXaH6EpJWYMj1Gyv1pFFVl7u0J56ecvI8b_3s4J6n6r8_CswZ1T1IKsg5LqdAA01gA5_shBrmGLDZTGnbgOh3jDtkfFjuxB9um3RbO24RJtsi6aD58WjyNSRyWQd0ELIgVToGPjU55S-4nismeYU41KIQP7KrSTwjc-BDLmGhRVMU1lJHnEtLEtVXCG2tkmg3OEwdY14pwabmynlGqlOY_tk0uXf6v66gJ1XCFWeeWFsaprcPn39B06HC2SyWoynr5coSO3MRUV5Bq1i-0ObiygF-q23Mcv1MuhcA |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2015+Asia-Pacific+Signal+and+Information+Processing+Association+Annual+Summit+and+Conference+%28APSIPA%29&rft.atitle=Improved+deep+speaker+feature+learning+for+text-dependent+speaker+recognition&rft.au=Lantian+Li&rft.au=Yiye+Lin&rft.au=Zhiyong+Zhang&rft.au=Dong+Wang&rft.date=2015-12-01&rft.pub=Asia-Pacific+Signal+and+Information+Processing+Association&rft.spage=426&rft.epage=429&rft_id=info:doi/10.1109%2FAPSIPA.2015.7415306&rft.externalDocID=7415306 |