Mapping biological entities using the longest approximately common prefix method

Background The significant growth in the volume of electronic biomedical data in recent decades has pointed to the need for approximate string matching algorithms that can expedite tasks such as named entity recognition, duplicate detection, terminology integration, and spelling correction. The task...

Full description

Saved in:

Bibliographic Details
Published in	BMC bioinformatics Vol. 15; no. 1; p. 187
Main Authors	Rudniy, Alex, Song, Min, Geller, James
Format	Journal Article
Language	English
Published	London BioMed Central 14.06.2014 BioMed Central Ltd
Subjects	Algorithms Approximate string matching Bioinformatics Biomedical and Life Sciences Computational Biology - methods Computational Biology/Bioinformatics Computer Appl. in Life Sciences Computer Graphics Humans Knowledge-based analysis Language Life Sciences Medical Methodology Methodology Article Microarrays Natural Language Processing Searching Similarity Strings Tasks Terminology as Topic Unified Medical Language System Unified Modeling Language Common Prefix Unify Medical Language System Average Precision String Similarity Histogram Intersection
Online Access	Get full text
ISSN	1471-2105 1471-2105
DOI	10.1186/1471-2105-15-187

Cover

More Information
Summary:	Background The significant growth in the volume of electronic biomedical data in recent decades has pointed to the need for approximate string matching algorithms that can expedite tasks such as named entity recognition, duplicate detection, terminology integration, and spelling correction. The task of source integration in the Unified Medical Language System (UMLS) requires considerable expert effort despite the presence of various computational tools. This problem warrants the search for a new method for approximate string matching and its UMLS-based evaluation. Results This paper introduces the Longest Approximately Common Prefix (LACP) method as an algorithm for approximate string matching that runs in linear time. We compare the LACP method for performance, precision and speed to nine other well-known string matching algorithms. As test data, we use two multiple-source samples from the Unified Medical Language System (UMLS) and two SNOMED Clinical Terms-based samples. In addition, we present a spell checker based on the LACP method. Conclusions The Longest Approximately Common Prefix method completes its string similarity evaluations in less time than all nine string similarity methods used for comparison. The Longest Approximately Common Prefix outperforms these nine approximate string matching methods in its Maximum F 1 measure when evaluated on three out of the four datasets, and in its average precision on two of the four datasets.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 ObjectType-Article-2 ObjectType-Feature-1
ISSN:	1471-2105 1471-2105
DOI:	10.1186/1471-2105-15-187