Similar Terms Grouping Yields Faster Terminological Saturation

This paper reports on the refinement of the algorithm for measuring terminological difference between text datasets (THD). This baseline THD algorithm, developed in the OntoElect project, used exact string matches for term comparison. In this work, it has been refined by the use of appropriately sel...

Full description

Saved in:

Bibliographic Details
Published in	Information and Communication Technologies in Education, Research, and Industrial Applications Vol. 1007; pp. 43 - 70
Main Authors	Kosa, Victoria, Chaves-Fraga, David, Keberle, Nataliya, Birukou, Aliaksandr
Format	Book Chapter
Language	English
Published	Switzerland Springer International Publishing AG 2019 Springer International Publishing
Series	Communications in Computer and Information Science
Subjects	Automated term extraction Bag of terms OntoElect String similarity measure Terminological difference Terminological saturation
Online Access	Get full text
ISBN	303013928X 9783030139285
ISSN	1865-0929 1865-0937
DOI	10.1007/978-3-030-13929-2_3

Cover

More Information
Summary:	This paper reports on the refinement of the algorithm for measuring terminological difference between text datasets (THD). This baseline THD algorithm, developed in the OntoElect project, used exact string matches for term comparison. In this work, it has been refined by the use of appropriately selected string similarity measures (SSM) for grouping the terms, which look similar as text strings and presumably have similar meanings. To determine rational term similarity thresholds for several chosen SSMs, the measures have been implemented as software functions and evaluated on the developed test set of term pairs in English. Further, the refined algorithm implementation has been evaluated against the baseline THD algorithm. For this evaluation, the bags of terms have been used that had been extracted from the three different document collections of scientific papers, belonging to different subject domains. The experiment revealed that the use of the refined THD algorithm, compared to the baseline, resulted in quicker terminological saturation on more compact sets of source documents, though at an expense of a noticeably higher computation time.
ISBN:	303013928X 9783030139285
ISSN:	1865-0929 1865-0937
DOI:	10.1007/978-3-030-13929-2_3