역사 자료 형태소 분석 말뭉치 프로그램 개발 및 고도화

This paper aims to provide a detailed explanation of the development and enhancement process of the historical data morpheme analysis program ‘UTagger-Hunminjeongeum’. In this paper, we introduce the morpheme analysis algorithm of ‘UTagger-Hunminjeongeum (ver. 0.9)’ and outline the steps taken to im...

Full description

Saved in:

Bibliographic Details
Published in	언어와 정보 사회 Vol. 54; pp. 191 - 219
Main Authors	장요한, Jang Yohan, 옥철영, Ock Choelyoung, 신승용, Shin Seungyong, 박시온, Park Sion
Format	Journal Article
Language	Korean
Published	서강대학교 언어정보연구소 31.03.2025 언어정보연구소
Subjects	corpus historical data morpheme morpheme analysis program tagging UTagger-훈민정음 UTagger-훈민정음 TCM 말뭉치 언어학 역사 자료 형태소 형태소 분석 프로그램
Online Access	Get full text
ISSN	1598-1886 2713-6817
DOI	10.29211/soli.2025.54..007

Cover

More Information
Summary:	This paper aims to provide a detailed explanation of the development and enhancement process of the historical data morpheme analysis program ‘UTagger-Hunminjeongeum’. In this paper, we introduce the morpheme analysis algorithm of ‘UTagger-Hunminjeongeum (ver. 0.9)’ and outline the steps taken to improve it. Additionally, we present the structure of the tagging tool ‘UTagger-Hunminjeongeum TCM’, which was independently developed to reduce manual errors and save time. This tool is used to create a small-scale morpheme-analyzed corpus for training ‘UTagger-Hunminjeongeum’. The paper also discusses the enhancements made after the training phase, such as converting Chinese character tagging into Hangul and tagging intonation markers (bangjeom). The program has achieved an accuracy rate of nearly 90% for trained data and over 80% for untrained data, with an overall accuracy rate ranging from 85% to 90%. With continued development and the inclusion of more diverse data, the program is expected to become a versatile and highly accurate morphological analysis tool.
Bibliography:	Language and Information Institute Sogang University
ISSN:	1598-1886 2713-6817
DOI:	10.29211/soli.2025.54..007