Nonparallel Spoken-Text-Style Transfer for Linguistic Expression Control in Speech Generation

Text style transfer is the task of converting the style of a text while preserving its content. However, under the nonparallel training condition, the task is still challenging, even for general-purpose large language models (LLMs). This study aims to improve the performance of text style transfer u...

Full description

Saved in:
Bibliographic Details
Published inIEEE Transactions on Audio, Speech and Language Processing Vol. 33; pp. 333 - 346
Main Authors Yoshioka, Daiki, Yasuda, Yusuke, Toda, Tomoki
Format Journal Article
LanguageEnglish
Published IEEE 2025
Subjects
Online AccessGet full text
ISSN2998-4173
2998-4173
DOI10.1109/TASLPRO.2024.3522757

Cover

More Information
Summary:Text style transfer is the task of converting the style of a text while preserving its content. However, under the nonparallel training condition, the task is still challenging, even for general-purpose large language models (LLMs). This study aims to improve the performance of text style transfer using task-specific methods in a labeled nonparallel condition. We target the transfer of spoken styles in the text domain to realize more flexible style control of synthetic speech when combined with text-to-speech synthesis. We propose a method for preserving "content words", particularly for improving content preservation, and incorporate it in the conditional variational autoencoder for capturing the style information from the labeled nonparallel corpus. Further improvements were achieved by introducing positional embedding and cycle learning in both content preservation and style control performance. We have conducted a bi-directional transfer experiment on Japanese texts, focusing on "disfluency removal/insertion" and "standard/Kansai dialect conversion" as target styles. Our experimental evaluations reveal that: 1) the proposed method enhanced content preservation and style control performance; 2) our proposed method shows higher content preservation and computational efficiency while achieving comparable style control performance to large language models; 3) the proposed method has a positive perceptual impact on generated speech in terms of style reproducibility when applied to a preprocessing for TTS.
ISSN:2998-4173
2998-4173
DOI:10.1109/TASLPRO.2024.3522757