Voice conversion from non-parallel corpora using variational auto-encoder

We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora. Many SC frameworks require parallel corpora, phonetic alignments, or explicit frame-wise correspondence for learning conversion functions or for synthesizing a target spectrum with the aid...

Full description

Saved in:

Bibliographic Details
Published in	2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) pp. 1 - 6
Main Authors	Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, Hsin-Min Wang
Format	Conference Proceeding
Language	English
Published	Asia Pacific Signal and Information Processing Association 01.12.2016
Subjects	Adaptation models Artificial neural networks Decoding Speech Speech recognition Training
Online Access	Get full text
DOI	10.1109/APSIPA.2016.7820786

Cover

Abstract	We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora. Many SC frameworks require parallel corpora, phonetic alignments, or explicit frame-wise correspondence for learning conversion functions or for synthesizing a target spectrum with the aid of alignments. However, these requirements gravely limit the scope of practical applications of SC due to scarcity or even unavailability of parallel corpora. We propose an SC framework based on variational auto-encoder which enables us to exploit non-parallel corpora. The framework comprises an encoder that learns speaker-independent phonetic representations and a decoder that learns to reconstruct the designated speaker. It removes the requirement of parallel corpora or phonetic alignments to train a spectral conversion system. We report objective and subjective evaluations to validate our proposed method and compare it to SC methods that have access to aligned corpora.
AbstractList	We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora. Many SC frameworks require parallel corpora, phonetic alignments, or explicit frame-wise correspondence for learning conversion functions or for synthesizing a target spectrum with the aid of alignments. However, these requirements gravely limit the scope of practical applications of SC due to scarcity or even unavailability of parallel corpora. We propose an SC framework based on variational auto-encoder which enables us to exploit non-parallel corpora. The framework comprises an encoder that learns speaker-independent phonetic representations and a decoder that learns to reconstruct the designated speaker. It removes the requirement of parallel corpora or phonetic alignments to train a spectral conversion system. We report objective and subjective evaluations to validate our proposed method and compare it to SC methods that have access to aligned corpora.
Author	Chin-Cheng Hsu Hsin-Te Hwang Hsin-Min Wang Yu Tsao Yi-Chiao Wu
Author_xml	– sequence: 1 surname: Chin-Cheng Hsu fullname: Chin-Cheng Hsu email: jeremycchsu@iis.sinica.edu.tw organization: Inst. of Inf. Sci., Taipei, Taiwan – sequence: 2 surname: Hsin-Te Hwang fullname: Hsin-Te Hwang email: hwanght@iis.sinica.edu.tw organization: Inst. of Inf. Sci., Taipei, Taiwan – sequence: 3 surname: Yi-Chiao Wu fullname: Yi-Chiao Wu email: tedwu@iis.sinica.edu.tw organization: Inst. of Inf. Sci., Taipei, Taiwan – sequence: 4 surname: Yu Tsao fullname: Yu Tsao email: yu.tsao@citi.sinica.edu.tw organization: Res. Center for Inf. Technol. Innovation, Taipei, Taiwan – sequence: 5 surname: Hsin-Min Wang fullname: Hsin-Min Wang email: whm@iis.sinica.edu.tw organization: Inst. of Inf. Sci., Taipei, Taiwan
BookMark	eNotj8tqwzAURFVoFk2aL8hGP2BXV0pkaWlCH4ZAAw3dhiv5uggcychOoH9fQ7OaxRkOM0v2GFMkxjYgSgBhX-rjV3OsSylAl5WRojL6gS2tMbCttJHmiTXfKXjiPsUb5TGkyLucLnz2FANm7HvqZ5iHlJFfxxB_-A1zwGluYs_xOqWCok8t5We26LAfaX3PFTu9vZ72H8Xh873Z14ciyC1MhUNHSoBRptu1O0Rr0UHbakAF1hsnyJIW2nYSnFdOWVEpZeQ8F61zoFZs868NRHQecrhg_j3fz6k_OoxKew
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/APSIPA.2016.7820786
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Xplore Digital Library IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9881476828 9789881476821
EndPage	6
ExternalDocumentID	7820786
Genre	orig-research
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-i241t-babe301838f5d5aa99ab1dd61a319c8b0e9e6069f21bc3b39073382768a9bb13
IEDL.DBID	RIE
IngestDate	Thu Jun 29 18:38:22 EDT 2023
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i241t-babe301838f5d5aa99ab1dd61a319c8b0e9e6069f21bc3b39073382768a9bb13
PageCount	6
ParticipantIDs	ieee_primary_7820786
PublicationCentury	2000
PublicationDate	2016-12
PublicationDateYYYYMMDD	2016-12-01
PublicationDate_xml	– month: 12 year: 2016 text: 2016-12
PublicationDecade	2010
PublicationTitle	2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)
PublicationTitleAbbrev	APSIPA
PublicationYear	2016
Publisher	Asia Pacific Signal and Information Processing Association
Publisher_xml	– name: Asia Pacific Signal and Information Processing Association
Score	2.19643
Snippet	We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora. Many SC frameworks require parallel corpora,...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Adaptation models Artificial neural networks Decoding Speech Speech recognition Training
Title	Voice conversion from non-parallel corpora using variational auto-encoder
URI	https://ieeexplore.ieee.org/document/7820786
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEA1tT55UWvGbHDyabZP9MDkWsbRCpWCV3kpmMytisVJ2PfjrnWTXiuLByxI2gYQMwzCT994wdgGpjcGoQmQmAZEUmRFQKCdsjE5D7lQcSGHTu2z8kNwu0kWLXW65MIgYwGcY-WF4y3frvPKlsr7XdrvSWZu16VtztRohITkw_eHsfjIberRWFjUrf7RMCRFjtMumX3vVQJGXqCohyj9-yTD-9zB7rPfNzeOzbdTZZy187bLJ45ocngcIeah_cU8b4ZTaC6_tvVrhiiaDZjH3SPcn_k45clMH5LYq18ILWjrc9Nh8dDO_HoumSYJ4puBbCrCA5KQ61kXqUmuNsSCdy6Ql58o1DNAgJSmmUBLyGGLjuzRqRVmGNQAyPmAdOgweMi6lzXBgnfFvc2Q57RLKNpTDVNFQqiPW9bewfKtlMJbNBRz__fuE7XhL1MiPU9YpNxWeUfwu4TwY7hM-rZ4s
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEB1qPehJpRW_zcGjaZvsh5tjEUurbSlYpbeS2cyKWKyUXQ_-epPsWlE8eAu7CxsyDMObvPcG4AIjHaCSGY9ViDzMYsUxk4brgEyCqZGBF4WNxnH_IbydRbMaXK61METkyWfUckt_l2-WaeFaZW3n7XaVxBuwGYVhGJVqrcpKSHRUuzu5H0y6jq8Vt6pvfwxN8TWjtwOjr7-VVJGXVpFjK_34ZcT43-3sQvNbnccm67qzBzV6bcDgcWlTnnkSue-AMSccYRbcc-fuvVjQwr70rsXMcd2f2LtFyVUnkOkiX3JnaWlo1YRp72Z63efVmAT-bMtvzlEj2TRNgiSLTKS1UhqFMbHQNr3SBDukyMIUlUmBaYCBcnMaE2lxhlaIItiHut0MHQATQsfU0Ua52zkbu8SEFm9IQ5G0SyEPoeFOYf5WGmHMqwM4-vvxOWz1p6PhfDgY3x3DtotKyQM5gXq-KujUVvMcz3wQPwG2p6F5
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2016+Asia-Pacific+Signal+and+Information+Processing+Association+Annual+Summit+and+Conference+%28APSIPA%29&rft.atitle=Voice+conversion+from+non-parallel+corpora+using+variational+auto-encoder&rft.au=Chin-Cheng+Hsu&rft.au=Hsin-Te+Hwang&rft.au=Yi-Chiao+Wu&rft.au=Yu+Tsao&rft.date=2016-12-01&rft.pub=Asia+Pacific+Signal+and+Information+Processing+Association&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FAPSIPA.2016.7820786&rft.externalDocID=7820786