A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units
Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. Howeve...
        Saved in:
      
    
          | Published in | IEEE transactions on audio, speech, and language processing Vol. 19; no. 5; pp. 1278 - 1288 | 
|---|---|
| Main Authors | , , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
        Piscataway, NJ
          IEEE
    
        01.07.2011
     Institute of Electrical and Electronics Engineers  | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 1558-7916 1558-7924  | 
| DOI | 10.1109/TASL.2010.2089679 | 
Cover
| Abstract | Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. However, as the footprint of the stored data is reduced, desired segments are not always available in the stored data, and audible discontinuities may result. On the other hand, statistical TTS (STTS) systems, in spite of having a smaller footprint than CTTS, synthesize speech that is free of such discontinuities. Yet, in general, STTS produces lower quality speech than CTTS, in terms of naturalness, as it is often sounding muffled. The muffling effect is due to over-smoothing of model-generated speech features. In order to gain from the advantages of each of the two approaches, we propose in this work to combine CTTS and STTS into a hybrid TTS (HTTS) system. Each utterance representation in HTTS is constructed from natural segments and model generated segments in an interweaved fashion via a hybrid dynamic path algorithm. Reported listening tests demonstrate the validity of the proposed approach. | 
    
|---|---|
| AbstractList | Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. However, as the footprint of the stored data is reduced, desired segments are not always available in the stored data, and audible discontinuities may result. On the other hand, statistical TTS (STTS) systems, in spite of having a smaller footprint than CTTS, synthesize speech that is free of such discontinuities. Yet, in general, STTS produces lower quality speech than CTTS, in terms of naturalness, as it is often sounding muffled. The muffling effect is due to over-smoothing of model-generated speech features. In order to gain from the advantages of each of the two approaches, we propose in this work to combine CTTS and STTS into a hybrid TTS (HTTS) system. Each utterance representation in HTTS is constructed from natural segments and model generated segments in an interweaved fashion via a hybrid dynamic path algorithm. Reported listening tests demonstrate the validity of the proposed approach. | 
    
| Author | Malah, D Shechtman, S Kons, Z Tiomkin, S  | 
    
| Author_xml | – sequence: 1 givenname: S surname: Tiomkin fullname: Tiomkin, S email: stasti@gmail.com organization: Dept. of Electr. Eng., Technion - Israel Inst. of Technol., Haifa, Israel – sequence: 2 givenname: D surname: Malah fullname: Malah, D email: malah@ee.technion.ac.il organization: Dept. of Electr. Eng., Technion - Israel Inst. of Technol., Haifa, Israel – sequence: 3 givenname: S surname: Shechtman fullname: Shechtman, S email: slava@il.ibm.com organization: Res. Lab., Speech Technol. Group, IBM, Haifa, Israel – sequence: 4 givenname: Z surname: Kons fullname: Kons, Z email: zvi@il.ibm.com organization: Res. Lab., Speech Technol. Group, IBM, Haifa, Israel  | 
    
| BackLink | http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=24286310$$DView record in Pascal Francis | 
    
| BookMark | eNp9kEFLwzAUx4NMcE4_gHjpRfDSmaRJ2hzHUCcMPGy7eCmv6QuLdOlsMnHf3o6NHTx4yj_w-733-F-TgW89EnLH6Jgxqp-Wk8V8zGn_5bTQKtcXZMikLNJcczE4Z6auyHUIn5SKTAk2JB-TZLavOlcnS_yJaWzTxRbRrJPFPkTcJMs1xGTabirnMfTBG4joIbpvTMDXySL2OURnoOkVH9cYXEhW3sVwQy4tNAFvT--IrF6el9NZOn9_fZtO5qnJpIypBoq2lpWBXJpMZbUtDKt0AXVlrayY1IpjxWwhOPJMAZUoCpSVVrbOwUA2Io_Huduu_dphiOXGBYNNAx7bXSiZypngVAvZow8nFEJ_se3AGxfKbec20O1LLnihMkZ7jh0507UhdGjPCKPloe_y0Hd56Ls89d07-R_HuEM5rY8duOZf8_5oOkQ8b5KKaqZF9gvSCZCK | 
    
| CODEN | ITASD8 | 
    
| CitedBy_id | crossref_primary_10_1007_s10772_015_9271_y crossref_primary_10_1080_02564602_2018_1432422 crossref_primary_10_1007_s10462_022_10315_0 crossref_primary_10_1007_s10772_017_9463_8 crossref_primary_10_1109_TASLP_2022_3171971 crossref_primary_10_1007_s10772_014_9263_3 crossref_primary_10_1145_2382434_2382435 crossref_primary_10_1007_s00530_020_00659_4 crossref_primary_10_1016_j_csl_2014_11_001 crossref_primary_10_1109_TASLP_2016_2537982 crossref_primary_10_1109_TASLP_2016_2598307 crossref_primary_10_1186_s13636_016_0082_0 crossref_primary_10_1016_j_csl_2016_08_004 crossref_primary_10_1016_j_compag_2020_105908  | 
    
| Cites_doi | 10.1109/ICASSP.2006.1660161 10.1109/ICASSP.2007.367302 10.1016/S0885-2308(02)00031-1 10.1109/ICASSP.1996.541110 10.1109/5.18626 10.1016/j.specom.2009.04.004 10.1109/ICASSP.1986.1168882 10.1093/ietisy/e89-d.11.2775 10.1109/TASL.2010.2040795  | 
    
| ContentType | Journal Article | 
    
| Copyright | 2015 INIST-CNRS | 
    
| Copyright_xml | – notice: 2015 INIST-CNRS | 
    
| DBID | 97E RIA RIE AAYXX CITATION IQODW 7SC 8FD JQ2 L7M L~C L~D  | 
    
| DOI | 10.1109/TASL.2010.2089679 | 
    
| DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Pascal-Francis Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts  Academic Computer and Information Systems Abstracts Professional  | 
    
| DatabaseTitle | CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional  | 
    
| DatabaseTitleList | Computer and Information Systems Abstracts  | 
    
| Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library (IEL) (UW System Shared) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher  | 
    
| DeliveryMethod | fulltext_linktorsrc | 
    
| Discipline | Engineering Applied Sciences  | 
    
| EISSN | 1558-7924 | 
    
| EndPage | 1288 | 
    
| ExternalDocumentID | 24286310 10_1109_TASL_2010_2089679 5609194  | 
    
| Genre | orig-research | 
    
| GroupedDBID | 0R~ 29I 4.4 5GY 5VS 6IK 97E AAJGR AASAJ AAWTH ABAZT ABQJQ ABVLG AETIX AGQYO AGSQL AHBIQ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD F5P HZ~ IFIPE IPLJI JAVBF LAI M43 O9- OCL RIA RIE RNS AAYXX CITATION IQODW RIG 7SC 8FD JQ2 L7M L~C L~D  | 
    
| ID | FETCH-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3 | 
    
| IEDL.DBID | RIE | 
    
| ISSN | 1558-7916 | 
    
| IngestDate | Thu Oct 02 04:20:14 EDT 2025 Mon Jul 21 09:18:10 EDT 2025 Thu Apr 24 23:03:32 EDT 2025 Wed Oct 01 01:44:53 EDT 2025 Tue Aug 26 17:15:44 EDT 2025  | 
    
| IsPeerReviewed | true | 
    
| IsScholarly | true | 
    
| Issue | 5 | 
    
| Keywords | statistical TTS Segmentation Sound quality dynamic path hybrid TTS Speech synthesis Algorithm Hybrid system Concatenative text-to-speech (CTTS) Linguistic analysis Concatenation TTS synthesis Database System synthesis  | 
    
| Language | English | 
    
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html CC BY 4.0  | 
    
| LinkModel | DirectLink | 
    
| MergedId | FETCHMERGED-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3 | 
    
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23  | 
    
| PQID | 1671420945 | 
    
| PQPubID | 23500 | 
    
| PageCount | 11 | 
    
| ParticipantIDs | pascalfrancis_primary_24286310 crossref_primary_10_1109_TASL_2010_2089679 ieee_primary_5609194 proquest_miscellaneous_1671420945 crossref_citationtrail_10_1109_TASL_2010_2089679  | 
    
| ProviderPackageCode | CITATION AAYXX  | 
    
| PublicationCentury | 2000 | 
    
| PublicationDate | 2011-07-01 | 
    
| PublicationDateYYYYMMDD | 2011-07-01 | 
    
| PublicationDate_xml | – month: 07 year: 2011 text: 2011-07-01 day: 01  | 
    
| PublicationDecade | 2010 | 
    
| PublicationPlace | Piscataway, NJ | 
    
| PublicationPlace_xml | – name: Piscataway, NJ | 
    
| PublicationTitle | IEEE transactions on audio, speech, and language processing | 
    
| PublicationTitleAbbrev | TASL | 
    
| PublicationYear | 2011 | 
    
| Publisher | IEEE Institute of Electrical and Electronics Engineers  | 
    
| Publisher_xml | – name: IEEE – name: Institute of Electrical and Electronics Engineers  | 
    
| References | ref12 black (ref4) 0 lu (ref16) 2009 chazan (ref23) 0 ref2 plumpe (ref19) 1998 eide (ref6) 2006 gonzalvo (ref18) 0 donovan (ref1) 0; 5 ling (ref15) 2008 ling (ref13) 0 (ref28) 0 ayelett (ref21) 2008 ref24 tokuda (ref8) 0 ref25 ref20 ling (ref11) 2006 ref22 pollet (ref10) 0 ref27 (ref7) 2000 ref9 ref3 furui (ref26) 1986; 69 kawai (ref17) 2004 ling (ref14) 0 fernandez (ref5) 2008  | 
    
| References_xml | – ident: ref24 doi: 10.1109/ICASSP.2006.1660161 – ident: ref12 doi: 10.1109/ICASSP.2007.367302 – year: 0 ident: ref14 article-title: The USTC and iflytek speech synthesis systems for Blizzard Challenge 2007 publication-title: Proc BLZ3'07 – volume: 69 start-page: 1310 year: 1986 ident: ref26 article-title: Speaker independent isolated word recognition based on dynamics emphesized cepstrum publication-title: Trans IECE Japan – year: 2006 ident: ref6 article-title: The IBM submission to the 2006 Blizzard text-to-speech challenge publication-title: Proc Blizzard Challange'06 – start-page: 1315 year: 0 ident: ref8 article-title: Speech parameter generation algorithms for HMM-based speech synthesis publication-title: Proc ICASSP'00 – ident: ref2 doi: 10.1016/S0885-2308(02)00031-1 – volume: 5 start-page: 1703 year: 0 ident: ref1 article-title: The IBM trainable speech synthesis system publication-title: Proc ICSLP'98 – ident: ref3 doi: 10.1109/ICASSP.1996.541110 – ident: ref25 doi: 10.1109/5.18626 – year: 2008 ident: ref5 article-title: The IBM submission to the 2008 text-to-speech blizzard challenge publication-title: Proc Blizzard Challenge'08 – year: 2000 ident: ref7 publication-title: Text-to-Speech using Clustered Context-Dependent Phoneme-based Units – start-page: 2569 year: 0 ident: ref23 article-title: Small footprint concatenative text-to-speech synthesis using complex envelop modeling publication-title: Proc Interspeech'05 – start-page: 1825 year: 0 ident: ref10 article-title: Synthesis by generation and concatenation of multiform segments publication-title: Proc Interspeech'08 – start-page: 416 year: 0 ident: ref18 article-title: Local minimum generation error criterion for hybrid HMM speech synthesis publication-title: Proc Interspeech'09 – year: 2008 ident: ref21 article-title: Combining statistical parametric speech synthesis and unit-selection for automatic voice cloning publication-title: Proc LangTech'08 – year: 1998 ident: ref19 article-title: HMM-based smoothing for concatenative speech synthesis publication-title: Proc ICSLP – start-page: 601 year: 0 ident: ref4 article-title: Automatically clustering similar units for unit selection in speech synthesis publication-title: Proc EUROSPEECH'97 – start-page: 3949 year: 0 ident: ref13 article-title: Minimum unit selection error training for HMM-based unit selection speech synthesis system publication-title: Proc ICASSP'08 – ident: ref9 doi: 10.1016/j.specom.2009.04.004 – ident: ref27 doi: 10.1109/ICASSP.1986.1168882 – ident: ref20 doi: 10.1093/ietisy/e89-d.11.2775 – ident: ref22 doi: 10.1109/TASL.2010.2040795 – year: 2008 ident: ref15 publication-title: The USTC System for Blizzard Challenge 2008 – year: 0 ident: ref28 publication-title: Mean Opinion Score (MOS) – year: 2004 ident: ref17 article-title: Ximera: A new TTS from ATR based on corpus-based technologies publication-title: Proc SSW5'04 – year: 2009 ident: ref16 publication-title: The USTC System for Blizzard Challenge'09 – start-page: 2034 year: 2006 ident: ref11 article-title: HMM-based unit selestion using frame sized speech segments publication-title: Proc Interspeech'06  | 
    
| SSID | ssj0043641 | 
    
| Score | 2.1507585 | 
    
| Snippet | Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech... | 
    
| SourceID | proquest pascalfrancis crossref ieee  | 
    
| SourceType | Aggregation Database Index Database Enrichment Source Publisher  | 
    
| StartPage | 1278 | 
    
| SubjectTerms | Applied sciences Concatenative text-to-speech (CTTS) Discontinuity dynamic path Dynamics Exact sciences and technology Footprints Heuristic algorithms Hidden Markov models Hybrid power systems hybrid TTS Information, signal and communications theory Natural languages Segments Signal processing Speech Speech processing Speech recognition statistical TTS Synthesis Telecommunications and information theory TTS synthesis  | 
    
| Title | A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units | 
    
| URI | https://ieeexplore.ieee.org/document/5609194 https://www.proquest.com/docview/1671420945  | 
    
| Volume | 19 | 
    
| hasFullText | 1 | 
    
| inHoldings | 1 | 
    
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE/IET Electronic Library (IEL) (UW System Shared) customDbUrl: eissn: 1558-7924 dateEnd: 20131231 omitProxy: false ssIdentifier: ssj0043641 issn: 1558-7916 databaseCode: RIE dateStart: 20060101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE  | 
    
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3JbhQxEC0lOcGBLSCGJTISJ0RPevN2HCGiESJcZiJFXFpeFQTqieieQ_h6quyeUViEuFlq92KXu_zKfn4F8Lq1GDbIOhauLVsMUKwsELXGwhrLfeW9VYntfv5JLC_aD5f88gDe7s_ChBAS-SzMqZj28v3GbWmp7BRnZ41B9yEcSiXyWa2d120b0WZtVK5IglFMO5hVqU_Xi9XHTOKqS6UFsbZuzUEpqQpRIs2AvRJzOos_PHOabs7uw_nuQzPL5Ot8O9q5-_GbhuP_tuQB3JtwJ1vkgfIQDkL_CO7eUiM8hs8LtryhA1xsTeHwuClW1yG4K5ZVzdn6yowM_YclpjwWSE429Ek4nJneM8KtSfYZ37O66RFZDl8GRqB2eAwXZ-_X75bFlHqhcAhAxkKbMkTPrTOSu0Y0PipXWa2MtzFyW1HOHvTyUbV1qBthSh5aFbjVInppnGmewFG_6cNTYIqbYDiPmvJcm1pqabRCXBjRlcgQmxmUO2N0btIlp_QY37oUn5S6I_t1ZL9ust8M3uxvuc6iHP-qfEz9v684df0MTn6x-P46YhYlEPTO4NVuCHT4y9E-iunDZjt0lZBVW2NczJ_9_dnP4U5eeiZW7ws4Gr9vw0vELqM9SYP2J2q_64I | 
    
| linkProvider | IEEE | 
    
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1Lb9QwEB6VcoAeeBXU5VGMxAmRbR62Yx9XiGqB3V42lSoukZ8qAmUrkj2UX89Mkl2VhxA3S3Ee9jjjb-zP3wC85hbDhjKPieMpxwDFlgmi1phYY4XPvLeqZ7svz-T8nH-8EBd78HZ3FiaE0JPPwpSK_V6-X7sNLZWd4OysMei-BbcF51wMp7W2fpcXkg_qqEKRCKMc9zCzVJ9Us9VioHHlqdKSeFs3ZqE-rQqRIk2L_RKHhBZ_-OZ-wjm9D8vtpw48k6_TTWen7sdvKo7_25YHcG9Enmw2DJWHsBeaR3BwQ4_wED7P2PyajnCxigLibp2srkJwl2zQNWfVpekYehBLXHkskKBsaHrpcGYazwi59sLP-J7VdYPYsv3SMoK17WM4P31fvZsnY_KFxCEE6RJt0hC9sM6UwhWy8FG5zGplvI1R2Iyy9qCfj4rnIS-kSUXgKgirZfSlcaZ4AvvNuglHwJQwwQgRNWW6NnmpS6MVIsOIzqQMsZhAujVG7UZlckqQ8a3uI5RU12S_muxXj_abwJvdLVeDLMe_Kh9S_-8qjl0_geNfLL67jqhFSYS9E3i1HQI1_nS0k2KasN60dSbLjOcYGYunf3_2S7gzr5aLevHh7NMzuDssRBPH9znsd9834QUimc4e9wP4J-NL7s8 | 
    
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Hybrid+Text-to-Speech+System+That+Combines+Concatenative+and+Statistical+Synthesis+Units&rft.jtitle=IEEE+transactions+on+audio%2C+speech%2C+and+language+processing&rft.au=Tiomkin%2C+Stas&rft.au=Malah%2C+David&rft.au=Shechtman%2C+Slava&rft.au=Kons%2C+Zvi&rft.date=2011-07-01&rft.issn=1558-7916&rft.eissn=1558-7924&rft.volume=19&rft.issue=5&rft.spage=1278&rft.epage=1288&rft_id=info:doi/10.1109%2FTASL.2010.2089679&rft.externalDBID=NO_FULL_TEXT | 
    
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1558-7916&client=summon | 
    
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1558-7916&client=summon | 
    
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1558-7916&client=summon |