Nonparallel Spoken-Text-Style Transfer for Linguistic Expression Control in Speech Generation

Text style transfer is the task of converting the style of a text while preserving its content. However, under the nonparallel training condition, the task is still challenging, even for general-purpose large language models (LLMs). This study aims to improve the performance of text style transfer u...

Full description

Saved in:

Bibliographic Details
Published in	IEEE Transactions on Audio, Speech and Language Processing Vol. 33; pp. 333 - 346
Main Authors	Yoshioka, Daiki, Yasuda, Yusuke, Toda, Tomoki
Format	Journal Article
Language	English
Published	IEEE 2025
Subjects	Autoencoders Decoding dialect disfluency Large language models Long short term memory Process control Reproducibility of results Speech enhancement spoken style style control Text style transfer Text to speech text-to-speech synthesis Translation
Online Access	Get full text
ISSN	2998-4173 2998-4173
DOI	10.1109/TASLPRO.2024.3522757

Cover

More Information
Summary:	Text style transfer is the task of converting the style of a text while preserving its content. However, under the nonparallel training condition, the task is still challenging, even for general-purpose large language models (LLMs). This study aims to improve the performance of text style transfer using task-specific methods in a labeled nonparallel condition. We target the transfer of spoken styles in the text domain to realize more flexible style control of synthetic speech when combined with text-to-speech synthesis. We propose a method for preserving "content words", particularly for improving content preservation, and incorporate it in the conditional variational autoencoder for capturing the style information from the labeled nonparallel corpus. Further improvements were achieved by introducing positional embedding and cycle learning in both content preservation and style control performance. We have conducted a bi-directional transfer experiment on Japanese texts, focusing on "disfluency removal/insertion" and "standard/Kansai dialect conversion" as target styles. Our experimental evaluations reveal that: 1) the proposed method enhanced content preservation and style control performance; 2) our proposed method shows higher content preservation and computational efficiency while achieving comparable style control performance to large language models; 3) the proposed method has a positive perceptual impact on generated speech in terms of style reproducibility when applied to a preprocessing for TTS.
ISSN:	2998-4173 2998-4173
DOI:	10.1109/TASLPRO.2024.3522757