기사체-방송체 텍스트 스타일 변환 연구

뉴스 텍스트-음성 변환 과정에는 기사체-방송체 변환이 요구된다. 기사체-방송체는 content 훼손에 민감하다는 특성을 가지고, 기사체의 괄호와 종결어미를 처리하여 방송체로의 변환이 가능하다. 스타일 토큰 기반 텍스트 스타일 변환 모델은 문장의 일부분만 변형하기 때문에, content 훼손을 최소화할 수 있어 기사체-방송체 변환에 적합하다. 그러나 기존 연구에서는 비병렬 데이터를 사용하여 스타일 토큰 학습이 어렵다는 단점이 있다. 병렬 데이터는 같은 content를 가진 두 문장에서 다른 부분을 명확한 스타일 토큰으로 구분 가능하지...

Full description

Saved in:

Bibliographic Details
Published in	디지털콘텐츠학회논문지 Vol. 24; no. 2; pp. 267 - 272
Main Authors	김경민(Kyung Min Kim), 임상훈(Sang Hun Im), 김기백(Gi Baeg Kim), 오흥선(Heung-Seon Oh)
Format	Journal Article
Language	Korean
Published	한국디지털콘텐츠학회 01.02.2023
Subjects	컴퓨터학 자연어처리 Artificial Intelligence Dataset 데이터셋 Natural Language Processing 딥러닝 인공지능 Text Style Transfer 텍스트 스타일 변환 Deep Learning
Online Access	Get full text
ISSN	1598-2009 2287-738X
DOI	10.9728/dcs.2023.24.2.267

Cover

More Information
Summary:	뉴스 텍스트-음성 변환 과정에는 기사체-방송체 변환이 요구된다. 기사체-방송체는 content 훼손에 민감하다는 특성을 가지고, 기사체의 괄호와 종결어미를 처리하여 방송체로의 변환이 가능하다. 스타일 토큰 기반 텍스트 스타일 변환 모델은 문장의 일부분만 변형하기 때문에, content 훼손을 최소화할 수 있어 기사체-방송체 변환에 적합하다. 그러나 기존 연구에서는 비병렬 데이터를 사용하여 스타일 토큰 학습이 어렵다는 단점이 있다. 병렬 데이터는 같은 content를 가진 두 문장에서 다른 부분을 명확한 스타일 토큰으로 구분 가능하지만, 구축에 비교적 많은 비용이 요구된다. 프롬프팅을 적용하여 학습에 요구되는 데이터를 줄일 수 있으나, 기존 방식으로는 스타일 토큰의 content 유지가 불가하다는 문제가 발생한다. 본 논문에서는 기사체-방송체 병렬 데이터 2,000건을 구축하였으며, 스타일 토큰의 content를 유지시키는 콘텐트 마커 프롬프팅을 새롭게 제안하였다. 또한 기사체-방송체 데이터셋에서 EM 0.9978의 높은 성능을 달성하였다. The news text-to-voice conversion process requires article-broadcast style transfer. The article-broadcast style has the characteristic of being sensitive to content corruption, and can be converted into a broadcast style by processing the parentheses and final suffixes of the article style. Since the style token-based text style transfer model modifies only a portion of the sentence, it is suitable for article-broadcast transfer as it can minimize content corruption. However, there is a disadvantage that learning style tokens are difficult in studies using non-parallel data. In parallel data, different parts of two sentences with the same content can be distinguished by clear style tokens, but it requires much cost to build. Although it is possible to reduce the data required for learning by prompting, there is a problem that the content of the style token cannot be maintained. In this paper, we construct 2,000 parallel article-broadcast data, and newly propose Content Marker Prompting that maintains the content of style tokens. The high performance of EM 0.9778 was achieved on the article-broadcast dataset. KCI Citation Count: 2
ISSN:	1598-2009 2287-738X
DOI:	10.9728/dcs.2023.24.2.267