A multi-aspect comparison study of supervised word sense disambiguation

The aim of this study was to investigate relations among different aspects in supervised word sense disambiguation (WSD; supervised machine learning for disambiguating the sense of a term in a context) and compare supervised WSD in the biomedical domain with that in the general English domain. The s...

Full description

Saved in:

Bibliographic Details
Published in	Journal of the American Medical Informatics Association : JAMIA Vol. 11; no. 4; pp. 320 - 331
Main Authors	Liu, Hongfang, Teller, Virginia, Friedman, Carol
Format	Journal Article
Language	English
Published	England Elsevier Inc 01.07.2004 Oxford University Press American Medical Informatics Association
Subjects	Abbreviations as Topic Algorithms Artificial Intelligence Bayes Theorem Natural Language Processing Original Investigations Vocabulary, Controlled
Online Access	Get full text
ISSN	1067-5027 1527-974X 1527-974X
DOI	10.1197/jamia.M1533

Cover

More Information
Summary:	The aim of this study was to investigate relations among different aspects in supervised word sense disambiguation (WSD; supervised machine learning for disambiguating the sense of a term in a context) and compare supervised WSD in the biomedical domain with that in the general English domain. The study involves three data sets (a biomedical abbreviation data set, a general biomedical term data set, and a general English data set). The authors implemented three machine-learning algorithms, including (1) naı̈ve Bayes (NBL) and decision lists (TDLL), (2) their adaptation of decision lists (ODLL), and (3) their mixed supervised learning (MSL). There were six feature representations (various combinations of collocations, bag of words, oriented bag of words, etc.) and five window sizes (2, 4, 6, 8, and 10). Supervised WSD is suitable only when there are enough sense-tagged instances with at least a few dozens of instances for each sense. Collocations combined with neighboring words are appropriate selections for the context. For terms with unrelated biomedical senses, a large window size such as the whole paragraph should be used, while for general English words a moderate window size between 4 and 10 should be used. The performance of the authors' implementation of decision list classifiers for abbreviations was better than that of traditional decision list classifiers. However, the opposite held for the other two sets. Also, the authors' mixed supervised learning was stable and generally better than others for all sets. From this study, it was found that different aspects of supervised WSD depend on each other. The experiment method presented in the study can be used to select the best supervised WSD classifier for each ambiguous term.
Bibliography:	SourceType-Scholarly Journals-1 ObjectType-General Information-1 content type line 14 ObjectType-Article-2 ObjectType-Feature-1 content type line 23 ObjectType-Article-1 ObjectType-Feature-2 Supported in part by NLM grant LM06274 and NSF grant NSF 0312250.
ISSN:	1067-5027 1527-974X 1527-974X
DOI:	10.1197/jamia.M1533