An Ultra-Fast and Parallelizable Algorithm for Finding [Formula Omitted]-Mismatch Shortest Unique Substrings

This paper revisits the [Formula Omitted]-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the [Formula Omitted]-mismatch average common substring problem can be adapted and combined with parts of the existing solution,...

Full description

Saved in:
Bibliographic Details
Published inIEEE/ACM transactions on computational biology and bioinformatics Vol. 18; no. 1; p. 138
Main Authors Allen, Daniel R, Thankachan, Sharma V, Xu, Bojian
Format Journal Article
LanguageEnglish
Published New York The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 01.01.2021
Subjects
Online AccessGet full text
ISSN1545-5963
1557-9964
DOI10.1109/TCBB.2020.2968531

Cover

More Information
Summary:This paper revisits the [Formula Omitted]-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the [Formula Omitted]-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of [Formula Omitted], while maintaining a practical space complexity at [Formula Omitted], where [Formula Omitted] is the string length. When [Formula Omitted], which is the hard case, our new proposal significantly improves the any-case [Formula Omitted] time complexity of the prior best method for [Formula Omitted]-mismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when [Formula Omitted] is small relative to [Formula Omitted]. For example, our method processes a 200 KB sample DNA sequence with [Formula Omitted] in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than [Formula Omitted] of the serial implementation's time cost, when processing a 10 MB sample DNA sequence with [Formula Omitted]. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the trade-off of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any [Formula Omitted], while this new proposal, using 24 cores, can finish processing a sample of this size with [Formula Omitted] in 206.376 seconds with a peak memory usage of 46 GB, which is both easily available and affordable on Cloud. It is expected that this new efficient and practical algorithm for [Formula Omitted]-mismatch shortest unique substring finding will prove useful to those using the measure on long sequences in fields such as computational biology. We also give a theoretical bound that the [Formula Omitted]-mismatch shortest unique substring finding problem can be solved using [Formula Omitted] time and [Formula Omitted] space, asymptotically much better than the one we implemented, serving as a new discovery of interest.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1545-5963
1557-9964
DOI:10.1109/TCBB.2020.2968531