Voice conversion from non-parallel corpora using variational auto-encoder

We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora. Many SC frameworks require parallel corpora, phonetic alignments, or explicit frame-wise correspondence for learning conversion functions or for synthesizing a target spectrum with the aid...

Full description

Saved in:
Bibliographic Details
Published in2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) pp. 1 - 6
Main Authors Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, Hsin-Min Wang
Format Conference Proceeding
LanguageEnglish
Published Asia Pacific Signal and Information Processing Association 01.12.2016
Subjects
Online AccessGet full text
DOI10.1109/APSIPA.2016.7820786

Cover

Abstract We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora. Many SC frameworks require parallel corpora, phonetic alignments, or explicit frame-wise correspondence for learning conversion functions or for synthesizing a target spectrum with the aid of alignments. However, these requirements gravely limit the scope of practical applications of SC due to scarcity or even unavailability of parallel corpora. We propose an SC framework based on variational auto-encoder which enables us to exploit non-parallel corpora. The framework comprises an encoder that learns speaker-independent phonetic representations and a decoder that learns to reconstruct the designated speaker. It removes the requirement of parallel corpora or phonetic alignments to train a spectral conversion system. We report objective and subjective evaluations to validate our proposed method and compare it to SC methods that have access to aligned corpora.
AbstractList We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora. Many SC frameworks require parallel corpora, phonetic alignments, or explicit frame-wise correspondence for learning conversion functions or for synthesizing a target spectrum with the aid of alignments. However, these requirements gravely limit the scope of practical applications of SC due to scarcity or even unavailability of parallel corpora. We propose an SC framework based on variational auto-encoder which enables us to exploit non-parallel corpora. The framework comprises an encoder that learns speaker-independent phonetic representations and a decoder that learns to reconstruct the designated speaker. It removes the requirement of parallel corpora or phonetic alignments to train a spectral conversion system. We report objective and subjective evaluations to validate our proposed method and compare it to SC methods that have access to aligned corpora.
Author Chin-Cheng Hsu
Hsin-Te Hwang
Hsin-Min Wang
Yu Tsao
Yi-Chiao Wu
Author_xml – sequence: 1
  surname: Chin-Cheng Hsu
  fullname: Chin-Cheng Hsu
  email: jeremycchsu@iis.sinica.edu.tw
  organization: Inst. of Inf. Sci., Taipei, Taiwan
– sequence: 2
  surname: Hsin-Te Hwang
  fullname: Hsin-Te Hwang
  email: hwanght@iis.sinica.edu.tw
  organization: Inst. of Inf. Sci., Taipei, Taiwan
– sequence: 3
  surname: Yi-Chiao Wu
  fullname: Yi-Chiao Wu
  email: tedwu@iis.sinica.edu.tw
  organization: Inst. of Inf. Sci., Taipei, Taiwan
– sequence: 4
  surname: Yu Tsao
  fullname: Yu Tsao
  email: yu.tsao@citi.sinica.edu.tw
  organization: Res. Center for Inf. Technol. Innovation, Taipei, Taiwan
– sequence: 5
  surname: Hsin-Min Wang
  fullname: Hsin-Min Wang
  email: whm@iis.sinica.edu.tw
  organization: Inst. of Inf. Sci., Taipei, Taiwan
BookMark eNotj8tqwzAURFVoFk2aL8hGP2BXV0pkaWlCH4ZAAw3dhiv5uggcychOoH9fQ7OaxRkOM0v2GFMkxjYgSgBhX-rjV3OsSylAl5WRojL6gS2tMbCttJHmiTXfKXjiPsUb5TGkyLucLnz2FANm7HvqZ5iHlJFfxxB_-A1zwGluYs_xOqWCok8t5We26LAfaX3PFTu9vZ72H8Xh873Z14ciyC1MhUNHSoBRptu1O0Rr0UHbakAF1hsnyJIW2nYSnFdOWVEpZeQ8F61zoFZs868NRHQecrhg_j3fz6k_OoxKew
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/APSIPA.2016.7820786
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Xplore Digital Library
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9881476828
9789881476821
EndPage 6
ExternalDocumentID 7820786
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i241t-babe301838f5d5aa99ab1dd61a319c8b0e9e6069f21bc3b39073382768a9bb13
IEDL.DBID RIE
IngestDate Thu Jun 29 18:38:22 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i241t-babe301838f5d5aa99ab1dd61a319c8b0e9e6069f21bc3b39073382768a9bb13
PageCount 6
ParticipantIDs ieee_primary_7820786
PublicationCentury 2000
PublicationDate 2016-12
PublicationDateYYYYMMDD 2016-12-01
PublicationDate_xml – month: 12
  year: 2016
  text: 2016-12
PublicationDecade 2010
PublicationTitle 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)
PublicationTitleAbbrev APSIPA
PublicationYear 2016
Publisher Asia Pacific Signal and Information Processing Association
Publisher_xml – name: Asia Pacific Signal and Information Processing Association
Score 2.19643
Snippet We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora. Many SC frameworks require parallel corpora,...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Adaptation models
Artificial neural networks
Decoding
Speech
Speech recognition
Training
Title Voice conversion from non-parallel corpora using variational auto-encoder
URI https://ieeexplore.ieee.org/document/7820786
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEA1tT55UWvGbHDyabZP9MDkWsbRCpWCV3kpmMytisVJ2PfjrnWTXiuLByxI2gYQMwzCT994wdgGpjcGoQmQmAZEUmRFQKCdsjE5D7lQcSGHTu2z8kNwu0kWLXW65MIgYwGcY-WF4y3frvPKlsr7XdrvSWZu16VtztRohITkw_eHsfjIberRWFjUrf7RMCRFjtMumX3vVQJGXqCohyj9-yTD-9zB7rPfNzeOzbdTZZy187bLJ45ocngcIeah_cU8b4ZTaC6_tvVrhiiaDZjH3SPcn_k45clMH5LYq18ILWjrc9Nh8dDO_HoumSYJ4puBbCrCA5KQ61kXqUmuNsSCdy6Ql58o1DNAgJSmmUBLyGGLjuzRqRVmGNQAyPmAdOgweMi6lzXBgnfFvc2Q57RLKNpTDVNFQqiPW9bewfKtlMJbNBRz__fuE7XhL1MiPU9YpNxWeUfwu4TwY7hM-rZ4s
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEB1qPehJpRW_zcGjaZvsh5tjEUurbSlYpbeS2cyKWKyUXQ_-epPsWlE8eAu7CxsyDMObvPcG4AIjHaCSGY9ViDzMYsUxk4brgEyCqZGBF4WNxnH_IbydRbMaXK61METkyWfUckt_l2-WaeFaZW3n7XaVxBuwGYVhGJVqrcpKSHRUuzu5H0y6jq8Vt6pvfwxN8TWjtwOjr7-VVJGXVpFjK_34ZcT43-3sQvNbnccm67qzBzV6bcDgcWlTnnkSue-AMSccYRbcc-fuvVjQwr70rsXMcd2f2LtFyVUnkOkiX3JnaWlo1YRp72Z63efVmAT-bMtvzlEj2TRNgiSLTKS1UhqFMbHQNr3SBDukyMIUlUmBaYCBcnMaE2lxhlaIItiHut0MHQATQsfU0Ua52zkbu8SEFm9IQ5G0SyEPoeFOYf5WGmHMqwM4-vvxOWz1p6PhfDgY3x3DtotKyQM5gXq-KujUVvMcz3wQPwG2p6F5
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2016+Asia-Pacific+Signal+and+Information+Processing+Association+Annual+Summit+and+Conference+%28APSIPA%29&rft.atitle=Voice+conversion+from+non-parallel+corpora+using+variational+auto-encoder&rft.au=Chin-Cheng+Hsu&rft.au=Hsin-Te+Hwang&rft.au=Yi-Chiao+Wu&rft.au=Yu+Tsao&rft.date=2016-12-01&rft.pub=Asia+Pacific+Signal+and+Information+Processing+Association&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FAPSIPA.2016.7820786&rft.externalDocID=7820786