On the string matching with k mismatches

In this paper, we discuss an efficient and effective index mechanism to do the string matching with k mismatches, by which we will find all the substrings in a target string s having at most k positions different from a pattern string r. The main idea is the Burrows–Wheeler transformation of s, deno...

Full description

Saved in:
Bibliographic Details
Published inTheoretical computer science Vol. 726; pp. 5 - 29
Main Authors Chen, Yangjun, Wu, Yujia
Format Journal Article
LanguageEnglish
Published Elsevier B.V 23.05.2018
Subjects
Online AccessGet full text
ISSN0304-3975
1879-2294
DOI10.1016/j.tcs.2018.02.001

Cover

More Information
Summary:In this paper, we discuss an efficient and effective index mechanism to do the string matching with k mismatches, by which we will find all the substrings in a target string s having at most k positions different from a pattern string r. The main idea is the Burrows–Wheeler transformation of s, denoted as BWT(s), used as an index to search r against it. During the process, the precomputed mismatch information of r will be utilized to speed up the BWT(s)'s navigation. In this way, the time complexity can be reduced to O(kn′+n+mlog⁡m), where m=|r|, n=|s|, and n′ is the number of leaf nodes of a tree structure, called a mismatching tree, produced during a search of BWT(s). In the case of m≥ 2(k+1), the average value of n′ is bounded by O((1+1|Σ|)k+1), where Σ is an alphabet from which we take symbols to make up target and pattern strings. Extensive experiments have been conducted, which show that our method for this problem is promising. Categories and Subject Descriptors: F.2.2 [Analysis of Algorithms and Problem Complexity]: Non-numerical Algorithms and Problems Pattern matching; computation on discrete structures General Terms: Databases, Algorithms, Performance •A new indexing method is proposed for solving the string matching with k mismatches based on the Burrows-Wheeler transformation.•The mismatching information of a pattern string is integrated into the search of a BWT array.•Some new concepts are introduced, such as search trees, mismatching paths, mismatching trees, and so on.•A solid mathematical analysis of the average time complexity of the method.•Extensive experiments have been conducted to compare our method with the most related approaches in the literature, which clearly shows the advantage of our method.•An elaborated example is designed to explain all the technical difficult points.
ISSN:0304-3975
1879-2294
DOI:10.1016/j.tcs.2018.02.001