A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units

Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. Howeve...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on audio, speech, and language processing Vol. 19; no. 5; pp. 1278 - 1288
Main Authors Tiomkin, S, Malah, D, Shechtman, S, Kons, Z
Format Journal Article
LanguageEnglish
Published Piscataway, NJ IEEE 01.07.2011
Institute of Electrical and Electronics Engineers
Subjects
Online AccessGet full text
ISSN1558-7916
1558-7924
DOI10.1109/TASL.2010.2089679

Cover

Abstract Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. However, as the footprint of the stored data is reduced, desired segments are not always available in the stored data, and audible discontinuities may result. On the other hand, statistical TTS (STTS) systems, in spite of having a smaller footprint than CTTS, synthesize speech that is free of such discontinuities. Yet, in general, STTS produces lower quality speech than CTTS, in terms of naturalness, as it is often sounding muffled. The muffling effect is due to over-smoothing of model-generated speech features. In order to gain from the advantages of each of the two approaches, we propose in this work to combine CTTS and STTS into a hybrid TTS (HTTS) system. Each utterance representation in HTTS is constructed from natural segments and model generated segments in an interweaved fashion via a hybrid dynamic path algorithm. Reported listening tests demonstrate the validity of the proposed approach.
AbstractList Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. However, as the footprint of the stored data is reduced, desired segments are not always available in the stored data, and audible discontinuities may result. On the other hand, statistical TTS (STTS) systems, in spite of having a smaller footprint than CTTS, synthesize speech that is free of such discontinuities. Yet, in general, STTS produces lower quality speech than CTTS, in terms of naturalness, as it is often sounding muffled. The muffling effect is due to over-smoothing of model-generated speech features. In order to gain from the advantages of each of the two approaches, we propose in this work to combine CTTS and STTS into a hybrid TTS (HTTS) system. Each utterance representation in HTTS is constructed from natural segments and model generated segments in an interweaved fashion via a hybrid dynamic path algorithm. Reported listening tests demonstrate the validity of the proposed approach.
Author Malah, D
Shechtman, S
Kons, Z
Tiomkin, S
Author_xml – sequence: 1
  givenname: S
  surname: Tiomkin
  fullname: Tiomkin, S
  email: stasti@gmail.com
  organization: Dept. of Electr. Eng., Technion - Israel Inst. of Technol., Haifa, Israel
– sequence: 2
  givenname: D
  surname: Malah
  fullname: Malah, D
  email: malah@ee.technion.ac.il
  organization: Dept. of Electr. Eng., Technion - Israel Inst. of Technol., Haifa, Israel
– sequence: 3
  givenname: S
  surname: Shechtman
  fullname: Shechtman, S
  email: slava@il.ibm.com
  organization: Res. Lab., Speech Technol. Group, IBM, Haifa, Israel
– sequence: 4
  givenname: Z
  surname: Kons
  fullname: Kons, Z
  email: zvi@il.ibm.com
  organization: Res. Lab., Speech Technol. Group, IBM, Haifa, Israel
BackLink http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=24286310$$DView record in Pascal Francis
BookMark eNp9kEFLwzAUx4NMcE4_gHjpRfDSmaRJ2hzHUCcMPGy7eCmv6QuLdOlsMnHf3o6NHTx4yj_w-733-F-TgW89EnLH6Jgxqp-Wk8V8zGn_5bTQKtcXZMikLNJcczE4Z6auyHUIn5SKTAk2JB-TZLavOlcnS_yJaWzTxRbRrJPFPkTcJMs1xGTabirnMfTBG4joIbpvTMDXySL2OURnoOkVH9cYXEhW3sVwQy4tNAFvT--IrF6el9NZOn9_fZtO5qnJpIypBoq2lpWBXJpMZbUtDKt0AXVlrayY1IpjxWwhOPJMAZUoCpSVVrbOwUA2Io_Huduu_dphiOXGBYNNAx7bXSiZypngVAvZow8nFEJ_se3AGxfKbec20O1LLnihMkZ7jh0507UhdGjPCKPloe_y0Hd56Ls89d07-R_HuEM5rY8duOZf8_5oOkQ8b5KKaqZF9gvSCZCK
CODEN ITASD8
CitedBy_id crossref_primary_10_1007_s10772_015_9271_y
crossref_primary_10_1080_02564602_2018_1432422
crossref_primary_10_1007_s10462_022_10315_0
crossref_primary_10_1007_s10772_017_9463_8
crossref_primary_10_1109_TASLP_2022_3171971
crossref_primary_10_1007_s10772_014_9263_3
crossref_primary_10_1145_2382434_2382435
crossref_primary_10_1007_s00530_020_00659_4
crossref_primary_10_1016_j_csl_2014_11_001
crossref_primary_10_1109_TASLP_2016_2537982
crossref_primary_10_1109_TASLP_2016_2598307
crossref_primary_10_1186_s13636_016_0082_0
crossref_primary_10_1016_j_csl_2016_08_004
crossref_primary_10_1016_j_compag_2020_105908
Cites_doi 10.1109/ICASSP.2006.1660161
10.1109/ICASSP.2007.367302
10.1016/S0885-2308(02)00031-1
10.1109/ICASSP.1996.541110
10.1109/5.18626
10.1016/j.specom.2009.04.004
10.1109/ICASSP.1986.1168882
10.1093/ietisy/e89-d.11.2775
10.1109/TASL.2010.2040795
ContentType Journal Article
Copyright 2015 INIST-CNRS
Copyright_xml – notice: 2015 INIST-CNRS
DBID 97E
RIA
RIE
AAYXX
CITATION
IQODW
7SC
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/TASL.2010.2089679
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Pascal-Francis
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Computer and Information Systems Abstracts
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library (IEL) (UW System Shared)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Applied Sciences
EISSN 1558-7924
EndPage 1288
ExternalDocumentID 24286310
10_1109_TASL_2010_2089679
5609194
Genre orig-research
GroupedDBID 0R~
29I
4.4
5GY
5VS
6IK
97E
AAJGR
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
AETIX
AGQYO
AGSQL
AHBIQ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
EJD
F5P
HZ~
IFIPE
IPLJI
JAVBF
LAI
M43
O9-
OCL
RIA
RIE
RNS
AAYXX
CITATION
IQODW
RIG
7SC
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3
IEDL.DBID RIE
ISSN 1558-7916
IngestDate Thu Oct 02 04:20:14 EDT 2025
Mon Jul 21 09:18:10 EDT 2025
Thu Apr 24 23:03:32 EDT 2025
Wed Oct 01 01:44:53 EDT 2025
Tue Aug 26 17:15:44 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 5
Keywords statistical TTS
Segmentation
Sound quality
dynamic path
hybrid TTS
Speech synthesis
Algorithm
Hybrid system
Concatenative text-to-speech (CTTS)
Linguistic analysis
Concatenation
TTS synthesis
Database
System synthesis
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
CC BY 4.0
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c355t-9a0efd5bca75c363df8c1b98adbff5b15962eb1f842e236a05e48e5b96fd7aca3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
PQID 1671420945
PQPubID 23500
PageCount 11
ParticipantIDs pascalfrancis_primary_24286310
crossref_primary_10_1109_TASL_2010_2089679
ieee_primary_5609194
proquest_miscellaneous_1671420945
crossref_citationtrail_10_1109_TASL_2010_2089679
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2011-07-01
PublicationDateYYYYMMDD 2011-07-01
PublicationDate_xml – month: 07
  year: 2011
  text: 2011-07-01
  day: 01
PublicationDecade 2010
PublicationPlace Piscataway, NJ
PublicationPlace_xml – name: Piscataway, NJ
PublicationTitle IEEE transactions on audio, speech, and language processing
PublicationTitleAbbrev TASL
PublicationYear 2011
Publisher IEEE
Institute of Electrical and Electronics Engineers
Publisher_xml – name: IEEE
– name: Institute of Electrical and Electronics Engineers
References ref12
black (ref4) 0
lu (ref16) 2009
chazan (ref23) 0
ref2
plumpe (ref19) 1998
eide (ref6) 2006
gonzalvo (ref18) 0
donovan (ref1) 0; 5
ling (ref15) 2008
ling (ref13) 0
(ref28) 0
ayelett (ref21) 2008
ref24
tokuda (ref8) 0
ref25
ref20
ling (ref11) 2006
ref22
pollet (ref10) 0
ref27
(ref7) 2000
ref9
ref3
furui (ref26) 1986; 69
kawai (ref17) 2004
ling (ref14) 0
fernandez (ref5) 2008
References_xml – ident: ref24
  doi: 10.1109/ICASSP.2006.1660161
– ident: ref12
  doi: 10.1109/ICASSP.2007.367302
– year: 0
  ident: ref14
  article-title: The USTC and iflytek speech synthesis systems for Blizzard Challenge 2007
  publication-title: Proc BLZ3'07
– volume: 69
  start-page: 1310
  year: 1986
  ident: ref26
  article-title: Speaker independent isolated word recognition based on dynamics emphesized cepstrum
  publication-title: Trans IECE Japan
– year: 2006
  ident: ref6
  article-title: The IBM submission to the 2006 Blizzard text-to-speech challenge
  publication-title: Proc Blizzard Challange'06
– start-page: 1315
  year: 0
  ident: ref8
  article-title: Speech parameter generation algorithms for HMM-based speech synthesis
  publication-title: Proc ICASSP'00
– ident: ref2
  doi: 10.1016/S0885-2308(02)00031-1
– volume: 5
  start-page: 1703
  year: 0
  ident: ref1
  article-title: The IBM trainable speech synthesis system
  publication-title: Proc ICSLP'98
– ident: ref3
  doi: 10.1109/ICASSP.1996.541110
– ident: ref25
  doi: 10.1109/5.18626
– year: 2008
  ident: ref5
  article-title: The IBM submission to the 2008 text-to-speech blizzard challenge
  publication-title: Proc Blizzard Challenge'08
– year: 2000
  ident: ref7
  publication-title: Text-to-Speech using Clustered Context-Dependent Phoneme-based Units
– start-page: 2569
  year: 0
  ident: ref23
  article-title: Small footprint concatenative text-to-speech synthesis using complex envelop modeling
  publication-title: Proc Interspeech'05
– start-page: 1825
  year: 0
  ident: ref10
  article-title: Synthesis by generation and concatenation of multiform segments
  publication-title: Proc Interspeech'08
– start-page: 416
  year: 0
  ident: ref18
  article-title: Local minimum generation error criterion for hybrid HMM speech synthesis
  publication-title: Proc Interspeech'09
– year: 2008
  ident: ref21
  article-title: Combining statistical parametric speech synthesis and unit-selection for automatic voice cloning
  publication-title: Proc LangTech'08
– year: 1998
  ident: ref19
  article-title: HMM-based smoothing for concatenative speech synthesis
  publication-title: Proc ICSLP
– start-page: 601
  year: 0
  ident: ref4
  article-title: Automatically clustering similar units for unit selection in speech synthesis
  publication-title: Proc EUROSPEECH'97
– start-page: 3949
  year: 0
  ident: ref13
  article-title: Minimum unit selection error training for HMM-based unit selection speech synthesis system
  publication-title: Proc ICASSP'08
– ident: ref9
  doi: 10.1016/j.specom.2009.04.004
– ident: ref27
  doi: 10.1109/ICASSP.1986.1168882
– ident: ref20
  doi: 10.1093/ietisy/e89-d.11.2775
– ident: ref22
  doi: 10.1109/TASL.2010.2040795
– year: 2008
  ident: ref15
  publication-title: The USTC System for Blizzard Challenge 2008
– year: 0
  ident: ref28
  publication-title: Mean Opinion Score (MOS)
– year: 2004
  ident: ref17
  article-title: Ximera: A new TTS from ATR based on corpus-based technologies
  publication-title: Proc SSW5'04
– year: 2009
  ident: ref16
  publication-title: The USTC System for Blizzard Challenge'09
– start-page: 2034
  year: 2006
  ident: ref11
  article-title: HMM-based unit selestion using frame sized speech segments
  publication-title: Proc Interspeech'06
SSID ssj0043641
Score 2.1507585
Snippet Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech...
SourceID proquest
pascalfrancis
crossref
ieee
SourceType Aggregation Database
Index Database
Enrichment Source
Publisher
StartPage 1278
SubjectTerms Applied sciences
Concatenative text-to-speech (CTTS)
Discontinuity
dynamic path
Dynamics
Exact sciences and technology
Footprints
Heuristic algorithms
Hidden Markov models
Hybrid power systems
hybrid TTS
Information, signal and communications theory
Natural languages
Segments
Signal processing
Speech
Speech processing
Speech recognition
statistical TTS
Synthesis
Telecommunications and information theory
TTS synthesis
Title A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units
URI https://ieeexplore.ieee.org/document/5609194
https://www.proquest.com/docview/1671420945
Volume 19
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE/IET Electronic Library (IEL) (UW System Shared)
  customDbUrl:
  eissn: 1558-7924
  dateEnd: 20131231
  omitProxy: false
  ssIdentifier: ssj0043641
  issn: 1558-7916
  databaseCode: RIE
  dateStart: 20060101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3JbhQxEC0lOcGBLSCGJTISJ0RPevN2HCGiESJcZiJFXFpeFQTqieieQ_h6quyeUViEuFlq92KXu_zKfn4F8Lq1GDbIOhauLVsMUKwsELXGwhrLfeW9VYntfv5JLC_aD5f88gDe7s_ChBAS-SzMqZj28v3GbWmp7BRnZ41B9yEcSiXyWa2d120b0WZtVK5IglFMO5hVqU_Xi9XHTOKqS6UFsbZuzUEpqQpRIs2AvRJzOos_PHOabs7uw_nuQzPL5Ot8O9q5-_GbhuP_tuQB3JtwJ1vkgfIQDkL_CO7eUiM8hs8LtryhA1xsTeHwuClW1yG4K5ZVzdn6yowM_YclpjwWSE429Ek4nJneM8KtSfYZ37O66RFZDl8GRqB2eAwXZ-_X75bFlHqhcAhAxkKbMkTPrTOSu0Y0PipXWa2MtzFyW1HOHvTyUbV1qBthSh5aFbjVInppnGmewFG_6cNTYIqbYDiPmvJcm1pqabRCXBjRlcgQmxmUO2N0btIlp_QY37oUn5S6I_t1ZL9ust8M3uxvuc6iHP-qfEz9v684df0MTn6x-P46YhYlEPTO4NVuCHT4y9E-iunDZjt0lZBVW2NczJ_9_dnP4U5eeiZW7ws4Gr9vw0vELqM9SYP2J2q_64I
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1Lb9QwEB6VcoAeeBXU5VGMxAmRbR62Yx9XiGqB3V42lSoukZ8qAmUrkj2UX89Mkl2VhxA3S3Ee9jjjb-zP3wC85hbDhjKPieMpxwDFlgmi1phYY4XPvLeqZ7svz-T8nH-8EBd78HZ3FiaE0JPPwpSK_V6-X7sNLZWd4OysMei-BbcF51wMp7W2fpcXkg_qqEKRCKMc9zCzVJ9Us9VioHHlqdKSeFs3ZqE-rQqRIk2L_RKHhBZ_-OZ-wjm9D8vtpw48k6_TTWen7sdvKo7_25YHcG9Enmw2DJWHsBeaR3BwQ4_wED7P2PyajnCxigLibp2srkJwl2zQNWfVpekYehBLXHkskKBsaHrpcGYazwi59sLP-J7VdYPYsv3SMoK17WM4P31fvZsnY_KFxCEE6RJt0hC9sM6UwhWy8FG5zGplvI1R2Iyy9qCfj4rnIS-kSUXgKgirZfSlcaZ4AvvNuglHwJQwwQgRNWW6NnmpS6MVIsOIzqQMsZhAujVG7UZlckqQ8a3uI5RU12S_muxXj_abwJvdLVeDLMe_Kh9S_-8qjl0_geNfLL67jqhFSYS9E3i1HQI1_nS0k2KasN60dSbLjOcYGYunf3_2S7gzr5aLevHh7NMzuDssRBPH9znsd9834QUimc4e9wP4J-NL7s8
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Hybrid+Text-to-Speech+System+That+Combines+Concatenative+and+Statistical+Synthesis+Units&rft.jtitle=IEEE+transactions+on+audio%2C+speech%2C+and+language+processing&rft.au=Tiomkin%2C+Stas&rft.au=Malah%2C+David&rft.au=Shechtman%2C+Slava&rft.au=Kons%2C+Zvi&rft.date=2011-07-01&rft.issn=1558-7916&rft.eissn=1558-7924&rft.volume=19&rft.issue=5&rft.spage=1278&rft.epage=1288&rft_id=info:doi/10.1109%2FTASL.2010.2089679&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1558-7916&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1558-7916&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1558-7916&client=summon