Parallel Methods for Finding k-Mismatch Shortest Unique Substrings Using GPU

k-mismatch shortest unique substring (SUS) queries have been proposed and studied very recently due to its useful applications in the subfield of computational biology. The k-mismatch SUS query over one given position of a string asks for a shortest substring that covers the given position and does...

Full description

Saved in:

Bibliographic Details
Published in	IEEE/ACM transactions on computational biology and bioinformatics Vol. 18; no. 1; pp. 386 - 395
Main Authors	Schultz, Daniel W., Xu, Bojian
Format	Journal Article
Language	English
Published	United States 01.01.2021
Subjects	Algorithms Computational Biology - methods Computer Graphics Image Processing, Computer-Assisted Sequence Analysis, DNA - methods
Online Access	Get full text
ISSN	1545-5963 1557-9964 1557-9964
DOI	10.1109/TCBB.2019.2935061

Cover

Abstract	k-mismatch shortest unique substring (SUS) queries have been proposed and studied very recently due to its useful applications in the subfield of computational biology. The k-mismatch SUS query over one given position of a string asks for a shortest substring that covers the given position and does not have a duplicate (within a Hamming distance of k) elsewhere in the string. The challenge in SUS query is to collectively find the SUS for every position of a massively long string in a both time- and space-efficient manner. All known efforts and results have been focused on improving and optimizing the time and space efficiency of SUS computation in the sequential CPU model. In this work, we propose the first parallel approach for k-mismatch SUS queries, particularly leveraging on the massive multi-threading architecture of the graphic processing unit (GPU) technology. Experimental study performed on a mid-end GPU using real-world biological data shows that our proposal is consistently faster than the fastest CPU solution by a factor of at least 6 for exact SUS queries ( k=0) and at least 23 for approximate SUS queries over DNA sequences ( ), while maintaining nearly the same peak memory usage as the most memory-efficient sequential CPU proposal. Our work provides practitioners a faster tool for SUS finding on massively long strings, and indeed provides the first practical tool for approximate SUS computation, because the any-case quadratical time cost of the state-of-the-art sequential CPU method for approximate SUS queries does not scale well even to modestly long strings.
AbstractList	k-mismatch shortest unique substring (SUS) queries have been proposed and studied very recently due to its useful applications in the subfield of computational biology. The k-mismatch SUS query over one given position of a string asks for a shortest substring that covers the given position and does not have a duplicate (within a Hamming distance of k) elsewhere in the string. The challenge in SUS query is to collectively find the SUS for every position of a massively long string in a both time- and space-efficient manner. All known efforts and results have been focused on improving and optimizing the time and space efficiency of SUS computation in the sequential CPU model. In this work, we propose the first parallel approach for k-mismatch SUS queries, particularly leveraging on the massive multi-threading architecture of the graphic processing unit (GPU) technology. Experimental study performed on a mid-end GPU using real-world biological data shows that our proposal is consistently faster than the fastest CPU solution by a factor of at least 6 for exact SUS queries ( k=0) and at least 23 for approximate SUS queries over DNA sequences ( ), while maintaining nearly the same peak memory usage as the most memory-efficient sequential CPU proposal. Our work provides practitioners a faster tool for SUS finding on massively long strings, and indeed provides the first practical tool for approximate SUS computation, because the any-case quadratical time cost of the state-of-the-art sequential CPU method for approximate SUS queries does not scale well even to modestly long strings.k-mismatch shortest unique substring (SUS) queries have been proposed and studied very recently due to its useful applications in the subfield of computational biology. The k-mismatch SUS query over one given position of a string asks for a shortest substring that covers the given position and does not have a duplicate (within a Hamming distance of k) elsewhere in the string. The challenge in SUS query is to collectively find the SUS for every position of a massively long string in a both time- and space-efficient manner. All known efforts and results have been focused on improving and optimizing the time and space efficiency of SUS computation in the sequential CPU model. In this work, we propose the first parallel approach for k-mismatch SUS queries, particularly leveraging on the massive multi-threading architecture of the graphic processing unit (GPU) technology. Experimental study performed on a mid-end GPU using real-world biological data shows that our proposal is consistently faster than the fastest CPU solution by a factor of at least 6 for exact SUS queries ( k=0) and at least 23 for approximate SUS queries over DNA sequences ( ), while maintaining nearly the same peak memory usage as the most memory-efficient sequential CPU proposal. Our work provides practitioners a faster tool for SUS finding on massively long strings, and indeed provides the first practical tool for approximate SUS computation, because the any-case quadratical time cost of the state-of-the-art sequential CPU method for approximate SUS queries does not scale well even to modestly long strings. k-mismatch shortest unique substring (SUS) queries have been proposed and studied very recently due to its useful applications in the subfield of computational biology. The k-mismatch SUS query over one given position of a string asks for a shortest substring that covers the given position and does not have a duplicate (within a Hamming distance of k) elsewhere in the string. The challenge in SUS query is to collectively find the SUS for every position of a massively long string in a both time- and space-efficient manner. All known efforts and results have been focused on improving and optimizing the time and space efficiency of SUS computation in the sequential CPU model. In this work, we propose the first parallel approach for k-mismatch SUS queries, particularly leveraging on the massive multi-threading architecture of the graphic processing unit (GPU) technology. Experimental study performed on a mid-end GPU using real-world biological data shows that our proposal is consistently faster than the fastest CPU solution by a factor of at least 6 for exact SUS queries ( k=0) and at least 23 for approximate SUS queries over DNA sequences ( ), while maintaining nearly the same peak memory usage as the most memory-efficient sequential CPU proposal. Our work provides practitioners a faster tool for SUS finding on massively long strings, and indeed provides the first practical tool for approximate SUS computation, because the any-case quadratical time cost of the state-of-the-art sequential CPU method for approximate SUS queries does not scale well even to modestly long strings.
Author	Xu, Bojian Schultz, Daniel W.
Author_xml	– sequence: 1 givenname: Daniel W. orcidid: 0000-0003-3912-2841 surname: Schultz fullname: Schultz, Daniel W. organization: Department of Electrical Engineering and Computer Science, The University of Tennessee, Knoxville, TN, USA – sequence: 2 givenname: Bojian orcidid: 0000-0001-5642-6826 surname: Xu fullname: Xu, Bojian organization: Department of Computer Science, Eastern Washington University, Cheney, WA, USA
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/31425048$$D View this record in MEDLINE/PubMed
BookMark	eNp9kDtPwzAUhS1URB_wA1hQRpYUP-N4pBUtSK2o1HaOHOeGGtKk2O7AvydRCwMD073Dd46OviHq1U0NCN0SPCYEq4fNdDIZU0zUmComcEIu0IAIIWOlEt7rfi5ioRLWR0Pv3zGmXGF-hfqMcCowTwdosdJOVxVU0RLCril8VDYumtm6sPVb9BEvrd_rYHbRete4AD5E29p-HiFaH3MfXAv5aOs7dr7aXqPLUlcebs53hDazp830OV68zl-mj4vYUCVDnAqWs1LzMgVOy1xwRnPO024RkUmS5sxgbkQKsoACtEylNJQpCiCBaMNG6P5Ue3BNO8WHbG-9garSNTRHn1HGEpVSjmmL3p3RY76HIjs4u9fuK_sR0ALkBBjXeO-g_EUIzjrJWSc56yRnZ8ltRv7JGBt0sE0dnLbVP8lv9uZ_Mw
CitedBy_id	crossref_primary_10_3390_a13090224 crossref_primary_10_3390_a13090234 crossref_primary_10_1371_journal_pone_0251047
Cites_doi	10.1007/978-3-319-07566-2_18 10.1016/j.tcs.2014.11.004 10.1007/3-540-48194-X_17 10.1007/978-3-319-94968-0_18 10.1007/978-3-319-04298-5_44 10.1007/978-3-319-11918-2_16 10.1007/978-3-319-18120-2_19 10.1007/978-3-662-48971-0_63 10.1186/1471-2105-6-123 10.1016/j.tcs.2017.05.032
ContentType	Journal Article
DBID	AAYXX CITATION CGR CUY CVF ECM EIF NPM 7X8
DOI	10.1109/TCBB.2019.2935061
DatabaseName	CrossRef Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic
DatabaseTitle	CrossRef MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic
DatabaseTitleList	MEDLINE - Academic MEDLINE
Database_xml	– sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Biology
EISSN	1557-9964
EndPage	395
ExternalDocumentID	31425048 10_1109_TCBB_2019_2935061
Genre	Journal Article
GroupedDBID	0R~ 29I 4.4 53G 5GY 5VS 6IK 8US 97E AAJGR AAKMM AALFJ AASAJ AAWTH AAWTV AAYXX ABAZT ABQJQ ABVLG ACGFO ACGFS ACIWK ACM ACPRK ADBCU ADL AEBYY AEFXT AEJOY AENEX AENSD AFRAH AFWIH AFWXC AGQYO AHBIQ AIKLT AKJIK AKQYR AKRVB ALMA_UNASSIGNED_HOLDINGS ASPBG ATWAV AVWKF BDXCO BEFXN BFFAM BGNUA BKEBE BPEOZ CCLIF CITATION CS3 DU5 EBS EJD FEDTE GUFHI HGAVV HZ~ I07 IEDLZ IFIPE IPLJI JAVBF LAI LHSKQ M43 O9- OCL P1C P2P PQQKQ RIA RIE RNS ROL TN5 AAYOK ADPZR AETIX AGSQL AIBXA CGR CUY CVF ECM EIF NPM RIG RNI RZB W7O XOL 7X8
ID	FETCH-LOGICAL-c297t-853b3fa4f8e42fb5432b448250417668b3c04c58e7dedea7877c2392ee7e1ac3
ISSN	1545-5963 1557-9964
IngestDate	Sun Sep 28 10:00:34 EDT 2025 Thu Apr 03 07:07:55 EDT 2025 Sat Oct 25 04:05:12 EDT 2025 Thu Apr 24 22:57:18 EDT 2025
IsPeerReviewed	true
IsScholarly	true
Issue	1
Language	English
License	https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-c297t-853b3fa4f8e42fb5432b448250417668b3c04c58e7dedea7877c2392ee7e1ac3
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ORCID	0000-0001-5642-6826 0000-0003-3912-2841
PMID	31425048
PQID	2336982402
PQPubID	23479
PageCount	10
ParticipantIDs	proquest_miscellaneous_2336982402 pubmed_primary_31425048 crossref_primary_10_1109_TCBB_2019_2935061 crossref_citationtrail_10_1109_TCBB_2019_2935061
PublicationCentury	2000
PublicationDate	2021-1-1 2021 Jan-Feb 20210101
PublicationDateYYYYMMDD	2021-01-01
PublicationDate_xml	– month: 01 year: 2021 text: 2021-1-1 day: 01
PublicationDecade	2020
PublicationPlace	United States
PublicationPlace_xml	– name: United States
PublicationTitle	IEEE/ACM transactions on computational biology and bioinformatics
PublicationTitleAlternate	IEEE/ACM Trans Comput Biol Bioinform
PublicationYear	2021
References	mieno (ref11) 2017 ref13 ref14 (ref15) 2017 ref2 ref1 mieno (ref10) 2016 ref8 ref7 ref9 (ref17) 2017 ref4 ref6 ref5 (ref16) 2017 wang (ref12) 2015 pei (ref3) 2013
References_xml	– start-page: 24:1 year: 2017 ident: ref11 article-title: Tight bounds on the maximum number of shortest unique substrings publication-title: Proc Ann Symp Combinatorial Pattern Matching – ident: ref5 doi: 10.1007/978-3-319-07566-2_18 – ident: ref6 doi: 10.1016/j.tcs.2014.11.004 – ident: ref13 doi: 10.1007/3-540-48194-X_17 – start-page: 573 year: 2015 ident: ref12 publication-title: Fast Parallel Suffix Array on the GPU – start-page: 937 year: 2013 ident: ref3 article-title: On shortest unique substring queries publication-title: Proc IEEE Int Conf Data Eng – ident: ref1 doi: 10.1007/978-3-319-94968-0_18 – start-page: 69:1 year: 2016 ident: ref10 article-title: Shortest unique substring queries on run-length encoded strings publication-title: Proc Int Symp Math Found Comput Sci – year: 2017 ident: ref16 – ident: ref4 doi: 10.1007/978-3-319-04298-5_44 – ident: ref7 doi: 10.1007/978-3-319-11918-2_16 – ident: ref14 doi: 10.1007/978-3-319-18120-2_19 – ident: ref8 doi: 10.1007/978-3-662-48971-0_63 – year: 2017 ident: ref15 – ident: ref2 doi: 10.1186/1471-2105-6-123 – ident: ref9 doi: 10.1016/j.tcs.2017.05.032 – year: 2017 ident: ref17
SSID	ssj0024904
Score	2.2611654
Snippet	k-mismatch shortest unique substring (SUS) queries have been proposed and studied very recently due to its useful applications in the subfield of computational...
SourceID	proquest pubmed crossref
SourceType	Aggregation Database Index Database Enrichment Source
StartPage	386
SubjectTerms	Algorithms Computational Biology - methods Computer Graphics Image Processing, Computer-Assisted Sequence Analysis, DNA - methods
Title	Parallel Methods for Finding k-Mismatch Shortest Unique Substrings Using GPU
URI	https://www.ncbi.nlm.nih.gov/pubmed/31425048 https://www.proquest.com/docview/2336982402
Volume	18
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 1557-9964 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0024904 issn: 1545-5963 databaseCode: RIE dateStart: 20040101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Zb9NAEF6FIiReKm7CpUXiiWiDvUdsP9KopUIEVWoq8mbZ67VSWiUocR7or2dmD8e9EOXFcpxkfMzn2ZndmW8I-VCXKuGpilitwX2TWSxZkYxqZqSOKxgussjO6U6-jw5P5NeZmvV6F93qkqYc6osb60r-R6twDPSKVbJ30GwrFA7APugXtqBh2P6Tjo-KFbZCwdxU7ANtqRUGB6euUOWMTU7X4I_q-eB4jim16wY9TKRrRWvRrGzDTpcy8OXopOulYgSIRAvjCbaQCP3E7cKCtl0gwgxil8IJ9j0La9PJoD_W88257Rnrq9kHP4bhu9nGomv5M0DUzz7wuDP74A2mVExl3kgZf0wlDOIoebOVbdHkTKYIVNjuk2u5ed2wW17U6XhvD9PxsiF4KSpyNO6XSbSvDG5tyqENdqIsRxE5isi9iHvkPocRIXK1f1uexsy2nmzvz6-Ig4hP167isk9zS6BiHZbpI7LrIw362cHmMemZxRPywPUe_f2UfAvgoR48FHRHPXjoFjw0gIc68NAteKgFDwXwPCPTg_3p-JD51hpM8yxpGDhppagLWadGcnhfpeAlBOrIZ4eMoWkpdCS1Sk1SmcoUYNUTzcGVNiYxcaHFc7KzWC7MS0Kl0objar4B15ubKq0kkhiWlREiLpKoT6LwbHLtaeex-8l5fqtG-uRj-5dfjnPlbz9-Hx54DpYRl7uKhVlu1jkXYpSluHrYJy-cJlpxAq4V7jV9dZdTvSYPt6_AG7LTrDbmLbikTfnOYucPSIWE1w
linkProvider	IEEE
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Parallel+Methods+for+Finding+k-Mismatch+Shortest+Unique+Substrings+Using+GPU&rft.jtitle=IEEE%2FACM+transactions+on+computational+biology+and+bioinformatics&rft.au=Schultz%2C+Daniel+W.&rft.au=Xu%2C+Bojian&rft.date=2021-01-01&rft.issn=1545-5963&rft.eissn=1557-9964&rft.volume=18&rft.issue=1&rft.spage=386&rft.epage=395&rft_id=info:doi/10.1109%2FTCBB.2019.2935061&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TCBB_2019_2935061
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1545-5963&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1545-5963&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1545-5963&client=summon