A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units

Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. Howeve...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on audio, speech, and language processing Vol. 19; no. 5; pp. 1278 - 1288
Main Authors	Tiomkin, S, Malah, D, Shechtman, S, Kons, Z
Format	Journal Article
Language	English
Published	Piscataway, NJ IEEE 01.07.2011 Institute of Electrical and Electronics Engineers
Subjects	Applied sciences Concatenative text-to-speech (CTTS) Discontinuity dynamic path Dynamics Exact sciences and technology Footprints Heuristic algorithms Hidden Markov models Hybrid power systems hybrid TTS Information, signal and communications theory Natural languages Segments Signal processing Speech Speech processing Speech recognition statistical TTS Synthesis Telecommunications and information theory TTS synthesis statistical TTS Segmentation Sound quality dynamic path hybrid TTS Speech synthesis Algorithm Hybrid system Concatenative text-to-speech (CTTS) Linguistic analysis Concatenation TTS synthesis Database System synthesis
Online Access	Get full text
ISSN	1558-7916 1558-7924
DOI	10.1109/TASL.2010.2089679

Cover

Abstract	Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. However, as the footprint of the stored data is reduced, desired segments are not always available in the stored data, and audible discontinuities may result. On the other hand, statistical TTS (STTS) systems, in spite of having a smaller footprint than CTTS, synthesize speech that is free of such discontinuities. Yet, in general, STTS produces lower quality speech than CTTS, in terms of naturalness, as it is often sounding muffled. The muffling effect is due to over-smoothing of model-generated speech features. In order to gain from the advantages of each of the two approaches, we propose in this work to combine CTTS and STTS into a hybrid TTS (HTTS) system. Each utterance representation in HTTS is constructed from natural segments and model generated segments in an interweaved fashion via a hybrid dynamic path algorithm. Reported listening tests demonstrate the validity of the proposed approach.
AbstractList	Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. However, as the footprint of the stored data is reduced, desired segments are not always available in the stored data, and audible discontinuities may result. On the other hand, statistical TTS (STTS) systems, in spite of having a smaller footprint than CTTS, synthesize speech that is free of such discontinuities. Yet, in general, STTS produces lower quality speech than CTTS, in terms of naturalness, as it is often sounding muffled. The muffling effect is due to over-smoothing of model-generated speech features. In order to gain from the advantages of each of the two approaches, we propose in this work to combine CTTS and STTS into a hybrid TTS (HTTS) system. Each utterance representation in HTTS is constructed from natural segments and model generated segments in an interweaved fashion via a hybrid dynamic path algorithm. Reported listening tests demonstrate the validity of the proposed approach.
Author	Malah, D Shechtman, S Kons, Z Tiomkin, S
Author_xml	– sequence: 1 givenname: S surname: Tiomkin fullname: Tiomkin, S email: stasti@gmail.com organization: Dept. of Electr. Eng., Technion - Israel Inst. of Technol., Haifa, Israel – sequence: 2 givenname: D surname: Malah fullname: Malah, D email: malah@ee.technion.ac.il organization: Dept. of Electr. Eng., Technion - Israel Inst. of Technol., Haifa, Israel – sequence: 3 givenname: S surname: Shechtman fullname: Shechtman, S email: slava@il.ibm.com organization: Res. Lab., Speech Technol. Group, IBM, Haifa, Israel – sequence: 4 givenname: Z surname: Kons fullname: Kons, Z email: zvi@il.ibm.com organization: Res. Lab., Speech Technol. Group, IBM, Haifa, Israel
BackLink	http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=24286310$$DView record in Pascal Francis
BookMark	eNp9kEFLwzAUx4NMcE4_gHjpRfDSmaRJ2hzHUCcMPGy7eCmv6QuLdOlsMnHf3o6NHTx4yj_w-733-F-TgW89EnLH6Jgxqp-Wk8V8zGn_5bTQKtcXZMikLNJcczE4Z6auyHUIn5SKTAk2JB-TZLavOlcnS_yJaWzTxRbRrJPFPkTcJMs1xGTabirnMfTBG4joIbpvTMDXySL2OURnoOkVH9cYXEhW3sVwQy4tNAFvT--IrF6el9NZOn9_fZtO5qnJpIypBoq2lpWBXJpMZbUtDKt0AXVlrayY1IpjxWwhOPJMAZUoCpSVVrbOwUA2Io_Huduu_dphiOXGBYNNAx7bXSiZypngVAvZow8nFEJ_se3AGxfKbec20O1LLnihMkZ7jh0507UhdGjPCKPloe_y0Hd56Ls89d07-R_HuEM5rY8duOZf8_5oOkQ8b5KKaqZF9gvSCZCK
CODEN	ITASD8
CitedBy_id	crossref_primary_10_1007_s10772_015_9271_y crossref_primary_10_1080_02564602_2018_1432422 crossref_primary_10_1007_s10462_022_10315_0 crossref_primary_10_1007_s10772_017_9463_8 crossref_primary_10_1109_TASLP_2022_3171971 crossref_primary_10_1007_s10772_014_9263_3 crossref_primary_10_1145_2382434_2382435 crossref_primary_10_1007_s00530_020_00659_4 crossref_primary_10_1016_j_csl_2014_11_001 crossref_primary_10_1109_TASLP_2016_2537982 crossref_primary_10_1109_TASLP_2016_2598307 crossref_primary_10_1186_s13636_016_0082_0 crossref_primary_10_1016_j_csl_2016_08_004 crossref_primary_10_1016_j_compag_2020_105908
Cites_doi	10.1109/ICASSP.2006.1660161 10.1109/ICASSP.2007.367302 10.1016/S0885-2308(02)00031-1 10.1109/ICASSP.1996.541110 10.1109/5.18626 10.1016/j.specom.2009.04.004 10.1109/ICASSP.1986.1168882 10.1093/ietisy/e89-d.11.2775 10.1109/TASL.2010.2040795
ContentType	Journal Article
Copyright	2015 INIST-CNRS
Copyright_xml	– notice: 2015 INIST-CNRS
DBID	97E RIA RIE AAYXX CITATION IQODW 7SC 8FD JQ2 L7M L~C L~D
DOI	10.1109/TASL.2010.2089679
DatabaseName	IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Pascal-Francis Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional
DatabaseTitle	CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional
DatabaseTitleList	Computer and Information Systems Abstracts
Database_xml	– sequence: 1 dbid: RIE name: IEEE/IET Electronic Library (IEL) (UW System Shared) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering Applied Sciences
EISSN	1558-7924
EndPage	1288
ExternalDocumentID	24286310 10_1109_TASL_2010_2089679 5609194
Genre	orig-research
GroupedDBID	0R~ 29I 4.4 5GY 5VS 6IK 97E AAJGR AASAJ AAWTH ABAZT ABQJQ ABVLG AETIX AGQYO AGSQL AHBIQ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD F5P HZ~ IFIPE IPLJI JAVBF LAI M43 O9- OCL RIA RIE RNS AAYXX CITATION IQODW RIG 7SC 8FD JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3
IEDL.DBID	RIE
ISSN	1558-7916
IngestDate	Thu Oct 02 04:20:14 EDT 2025 Mon Jul 21 09:18:10 EDT 2025 Thu Apr 24 23:03:32 EDT 2025 Wed Oct 01 01:44:53 EDT 2025 Tue Aug 26 17:15:44 EDT 2025
IsPeerReviewed	true
IsScholarly	true
Issue	5
Keywords	statistical TTS Segmentation Sound quality dynamic path hybrid TTS Speech synthesis Algorithm Hybrid system Concatenative text-to-speech (CTTS) Linguistic analysis Concatenation TTS synthesis Database System synthesis
Language	English
License	https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html CC BY 4.0
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
PQID	1671420945
PQPubID	23500
PageCount	11
ParticipantIDs	pascalfrancis_primary_24286310 crossref_primary_10_1109_TASL_2010_2089679 ieee_primary_5609194 proquest_miscellaneous_1671420945 crossref_citationtrail_10_1109_TASL_2010_2089679
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2011-07-01
PublicationDateYYYYMMDD	2011-07-01
PublicationDate_xml	– month: 07 year: 2011 text: 2011-07-01 day: 01
PublicationDecade	2010
PublicationPlace	Piscataway, NJ
PublicationPlace_xml	– name: Piscataway, NJ
PublicationTitle	IEEE transactions on audio, speech, and language processing
PublicationTitleAbbrev	TASL
PublicationYear	2011
Publisher	IEEE Institute of Electrical and Electronics Engineers
Publisher_xml	– name: IEEE – name: Institute of Electrical and Electronics Engineers
References	ref12 black (ref4) 0 lu (ref16) 2009 chazan (ref23) 0 ref2 plumpe (ref19) 1998 eide (ref6) 2006 gonzalvo (ref18) 0 donovan (ref1) 0; 5 ling (ref15) 2008 ling (ref13) 0 (ref28) 0 ayelett (ref21) 2008 ref24 tokuda (ref8) 0 ref25 ref20 ling (ref11) 2006 ref22 pollet (ref10) 0 ref27 (ref7) 2000 ref9 ref3 furui (ref26) 1986; 69 kawai (ref17) 2004 ling (ref14) 0 fernandez (ref5) 2008
References_xml	– ident: ref24 doi: 10.1109/ICASSP.2006.1660161 – ident: ref12 doi: 10.1109/ICASSP.2007.367302 – year: 0 ident: ref14 article-title: The USTC and iflytek speech synthesis systems for Blizzard Challenge 2007 publication-title: Proc BLZ3'07 – volume: 69 start-page: 1310 year: 1986 ident: ref26 article-title: Speaker independent isolated word recognition based on dynamics emphesized cepstrum publication-title: Trans IECE Japan – year: 2006 ident: ref6 article-title: The IBM submission to the 2006 Blizzard text-to-speech challenge publication-title: Proc Blizzard Challange'06 – start-page: 1315 year: 0 ident: ref8 article-title: Speech parameter generation algorithms for HMM-based speech synthesis publication-title: Proc ICASSP'00 – ident: ref2 doi: 10.1016/S0885-2308(02)00031-1 – volume: 5 start-page: 1703 year: 0 ident: ref1 article-title: The IBM trainable speech synthesis system publication-title: Proc ICSLP'98 – ident: ref3 doi: 10.1109/ICASSP.1996.541110 – ident: ref25 doi: 10.1109/5.18626 – year: 2008 ident: ref5 article-title: The IBM submission to the 2008 text-to-speech blizzard challenge publication-title: Proc Blizzard Challenge'08 – year: 2000 ident: ref7 publication-title: Text-to-Speech using Clustered Context-Dependent Phoneme-based Units – start-page: 2569 year: 0 ident: ref23 article-title: Small footprint concatenative text-to-speech synthesis using complex envelop modeling publication-title: Proc Interspeech'05 – start-page: 1825 year: 0 ident: ref10 article-title: Synthesis by generation and concatenation of multiform segments publication-title: Proc Interspeech'08 – start-page: 416 year: 0 ident: ref18 article-title: Local minimum generation error criterion for hybrid HMM speech synthesis publication-title: Proc Interspeech'09 – year: 2008 ident: ref21 article-title: Combining statistical parametric speech synthesis and unit-selection for automatic voice cloning publication-title: Proc LangTech'08 – year: 1998 ident: ref19 article-title: HMM-based smoothing for concatenative speech synthesis publication-title: Proc ICSLP – start-page: 601 year: 0 ident: ref4 article-title: Automatically clustering similar units for unit selection in speech synthesis publication-title: Proc EUROSPEECH'97 – start-page: 3949 year: 0 ident: ref13 article-title: Minimum unit selection error training for HMM-based unit selection speech synthesis system publication-title: Proc ICASSP'08 – ident: ref9 doi: 10.1016/j.specom.2009.04.004 – ident: ref27 doi: 10.1109/ICASSP.1986.1168882 – ident: ref20 doi: 10.1093/ietisy/e89-d.11.2775 – ident: ref22 doi: 10.1109/TASL.2010.2040795 – year: 2008 ident: ref15 publication-title: The USTC System for Blizzard Challenge 2008 – year: 0 ident: ref28 publication-title: Mean Opinion Score (MOS) – year: 2004 ident: ref17 article-title: Ximera: A new TTS from ATR based on corpus-based technologies publication-title: Proc SSW5'04 – year: 2009 ident: ref16 publication-title: The USTC System for Blizzard Challenge'09 – start-page: 2034 year: 2006 ident: ref11 article-title: HMM-based unit selestion using frame sized speech segments publication-title: Proc Interspeech'06
SSID	ssj0043641
Score	2.1507585
Snippet	Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech...
SourceID	proquest pascalfrancis crossref ieee
SourceType	Aggregation Database Index Database Enrichment Source Publisher
StartPage	1278
SubjectTerms	Applied sciences Concatenative text-to-speech (CTTS) Discontinuity dynamic path Dynamics Exact sciences and technology Footprints Heuristic algorithms Hidden Markov models Hybrid power systems hybrid TTS Information, signal and communications theory Natural languages Segments Signal processing Speech Speech processing Speech recognition statistical TTS Synthesis Telecommunications and information theory TTS synthesis
Title	A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units
URI	https://ieeexplore.ieee.org/document/5609194 https://www.proquest.com/docview/1671420945
Volume	19
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVIEE databaseName: IEEE/IET Electronic Library (IEL) (UW System Shared) customDbUrl: eissn: 1558-7924 dateEnd: 20131231 omitProxy: false ssIdentifier: ssj0043641 issn: 1558-7916 databaseCode: RIE dateStart: 20060101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3JbhQxEC0lOcGBLSCGJTISJ0RPevN2HCGiESJcZiJFXFpeFQTqieieQ_h6quyeUViEuFlq92KXu_zKfn4F8Lq1GDbIOhauLVsMUKwsELXGwhrLfeW9VYntfv5JLC_aD5f88gDe7s_ChBAS-SzMqZj28v3GbWmp7BRnZ41B9yEcSiXyWa2d120b0WZtVK5IglFMO5hVqU_Xi9XHTOKqS6UFsbZuzUEpqQpRIs2AvRJzOos_PHOabs7uw_nuQzPL5Ot8O9q5-_GbhuP_tuQB3JtwJ1vkgfIQDkL_CO7eUiM8hs8LtryhA1xsTeHwuClW1yG4K5ZVzdn6yowM_YclpjwWSE429Ek4nJneM8KtSfYZ37O66RFZDl8GRqB2eAwXZ-_X75bFlHqhcAhAxkKbMkTPrTOSu0Y0PipXWa2MtzFyW1HOHvTyUbV1qBthSh5aFbjVInppnGmewFG_6cNTYIqbYDiPmvJcm1pqabRCXBjRlcgQmxmUO2N0btIlp_QY37oUn5S6I_t1ZL9ust8M3uxvuc6iHP-qfEz9v684df0MTn6x-P46YhYlEPTO4NVuCHT4y9E-iunDZjt0lZBVW2NczJ_9_dnP4U5eeiZW7ws4Gr9vw0vELqM9SYP2J2q_64I
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1Lb9QwEB6VcoAeeBXU5VGMxAmRbR62Yx9XiGqB3V42lSoukZ8qAmUrkj2UX89Mkl2VhxA3S3Ee9jjjb-zP3wC85hbDhjKPieMpxwDFlgmi1phYY4XPvLeqZ7svz-T8nH-8EBd78HZ3FiaE0JPPwpSK_V6-X7sNLZWd4OysMei-BbcF51wMp7W2fpcXkg_qqEKRCKMc9zCzVJ9Us9VioHHlqdKSeFs3ZqE-rQqRIk2L_RKHhBZ_-OZ-wjm9D8vtpw48k6_TTWen7sdvKo7_25YHcG9Enmw2DJWHsBeaR3BwQ4_wED7P2PyajnCxigLibp2srkJwl2zQNWfVpekYehBLXHkskKBsaHrpcGYazwi59sLP-J7VdYPYsv3SMoK17WM4P31fvZsnY_KFxCEE6RJt0hC9sM6UwhWy8FG5zGplvI1R2Iyy9qCfj4rnIS-kSUXgKgirZfSlcaZ4AvvNuglHwJQwwQgRNWW6NnmpS6MVIsOIzqQMsZhAujVG7UZlckqQ8a3uI5RU12S_muxXj_abwJvdLVeDLMe_Kh9S_-8qjl0_geNfLL67jqhFSYS9E3i1HQI1_nS0k2KasN60dSbLjOcYGYunf3_2S7gzr5aLevHh7NMzuDssRBPH9znsd9834QUimc4e9wP4J-NL7s8
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Hybrid+Text-to-Speech+System+That+Combines+Concatenative+and+Statistical+Synthesis+Units&rft.jtitle=IEEE+transactions+on+audio%2C+speech%2C+and+language+processing&rft.au=Tiomkin%2C+Stas&rft.au=Malah%2C+David&rft.au=Shechtman%2C+Slava&rft.au=Kons%2C+Zvi&rft.date=2011-07-01&rft.issn=1558-7916&rft.eissn=1558-7924&rft.volume=19&rft.issue=5&rft.spage=1278&rft.epage=1288&rft_id=info:doi/10.1109%2FTASL.2010.2089679&rft.externalDBID=NO_FULL_TEXT
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1558-7916&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1558-7916&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1558-7916&client=summon