Parallel Methods for Finding k-Mismatch Shortest Unique Substrings Using GPU

k-mismatch shortest unique substring (SUS) queries have been proposed and studied very recently due to its useful applications in the subfield of computational biology. The k-mismatch SUS query over one given position of a string asks for a shortest substring that covers the given position and does...

Full description

Saved in:
Bibliographic Details
Published inIEEE/ACM transactions on computational biology and bioinformatics Vol. 18; no. 1; pp. 386 - 395
Main Authors Schultz, Daniel W., Xu, Bojian
Format Journal Article
LanguageEnglish
Published United States 01.01.2021
Subjects
Online AccessGet full text
ISSN1545-5963
1557-9964
1557-9964
DOI10.1109/TCBB.2019.2935061

Cover

Abstract k-mismatch shortest unique substring (SUS) queries have been proposed and studied very recently due to its useful applications in the subfield of computational biology. The k-mismatch SUS query over one given position of a string asks for a shortest substring that covers the given position and does not have a duplicate (within a Hamming distance of k) elsewhere in the string. The challenge in SUS query is to collectively find the SUS for every position of a massively long string in a both time- and space-efficient manner. All known efforts and results have been focused on improving and optimizing the time and space efficiency of SUS computation in the sequential CPU model. In this work, we propose the first parallel approach for k-mismatch SUS queries, particularly leveraging on the massive multi-threading architecture of the graphic processing unit (GPU) technology. Experimental study performed on a mid-end GPU using real-world biological data shows that our proposal is consistently faster than the fastest CPU solution by a factor of at least 6 for exact SUS queries ( k=0) and at least 23 for approximate SUS queries over DNA sequences ( ), while maintaining nearly the same peak memory usage as the most memory-efficient sequential CPU proposal. Our work provides practitioners a faster tool for SUS finding on massively long strings, and indeed provides the first practical tool for approximate SUS computation, because the any-case quadratical time cost of the state-of-the-art sequential CPU method for approximate SUS queries does not scale well even to modestly long strings.
AbstractList k-mismatch shortest unique substring (SUS) queries have been proposed and studied very recently due to its useful applications in the subfield of computational biology. The k-mismatch SUS query over one given position of a string asks for a shortest substring that covers the given position and does not have a duplicate (within a Hamming distance of k) elsewhere in the string. The challenge in SUS query is to collectively find the SUS for every position of a massively long string in a both time- and space-efficient manner. All known efforts and results have been focused on improving and optimizing the time and space efficiency of SUS computation in the sequential CPU model. In this work, we propose the first parallel approach for k-mismatch SUS queries, particularly leveraging on the massive multi-threading architecture of the graphic processing unit (GPU) technology. Experimental study performed on a mid-end GPU using real-world biological data shows that our proposal is consistently faster than the fastest CPU solution by a factor of at least 6 for exact SUS queries ( k=0) and at least 23 for approximate SUS queries over DNA sequences ( ), while maintaining nearly the same peak memory usage as the most memory-efficient sequential CPU proposal. Our work provides practitioners a faster tool for SUS finding on massively long strings, and indeed provides the first practical tool for approximate SUS computation, because the any-case quadratical time cost of the state-of-the-art sequential CPU method for approximate SUS queries does not scale well even to modestly long strings.k-mismatch shortest unique substring (SUS) queries have been proposed and studied very recently due to its useful applications in the subfield of computational biology. The k-mismatch SUS query over one given position of a string asks for a shortest substring that covers the given position and does not have a duplicate (within a Hamming distance of k) elsewhere in the string. The challenge in SUS query is to collectively find the SUS for every position of a massively long string in a both time- and space-efficient manner. All known efforts and results have been focused on improving and optimizing the time and space efficiency of SUS computation in the sequential CPU model. In this work, we propose the first parallel approach for k-mismatch SUS queries, particularly leveraging on the massive multi-threading architecture of the graphic processing unit (GPU) technology. Experimental study performed on a mid-end GPU using real-world biological data shows that our proposal is consistently faster than the fastest CPU solution by a factor of at least 6 for exact SUS queries ( k=0) and at least 23 for approximate SUS queries over DNA sequences ( ), while maintaining nearly the same peak memory usage as the most memory-efficient sequential CPU proposal. Our work provides practitioners a faster tool for SUS finding on massively long strings, and indeed provides the first practical tool for approximate SUS computation, because the any-case quadratical time cost of the state-of-the-art sequential CPU method for approximate SUS queries does not scale well even to modestly long strings.
k-mismatch shortest unique substring (SUS) queries have been proposed and studied very recently due to its useful applications in the subfield of computational biology. The k-mismatch SUS query over one given position of a string asks for a shortest substring that covers the given position and does not have a duplicate (within a Hamming distance of k) elsewhere in the string. The challenge in SUS query is to collectively find the SUS for every position of a massively long string in a both time- and space-efficient manner. All known efforts and results have been focused on improving and optimizing the time and space efficiency of SUS computation in the sequential CPU model. In this work, we propose the first parallel approach for k-mismatch SUS queries, particularly leveraging on the massive multi-threading architecture of the graphic processing unit (GPU) technology. Experimental study performed on a mid-end GPU using real-world biological data shows that our proposal is consistently faster than the fastest CPU solution by a factor of at least 6 for exact SUS queries ( k=0) and at least 23 for approximate SUS queries over DNA sequences ( ), while maintaining nearly the same peak memory usage as the most memory-efficient sequential CPU proposal. Our work provides practitioners a faster tool for SUS finding on massively long strings, and indeed provides the first practical tool for approximate SUS computation, because the any-case quadratical time cost of the state-of-the-art sequential CPU method for approximate SUS queries does not scale well even to modestly long strings.
Author Xu, Bojian
Schultz, Daniel W.
Author_xml – sequence: 1
  givenname: Daniel W.
  orcidid: 0000-0003-3912-2841
  surname: Schultz
  fullname: Schultz, Daniel W.
  organization: Department of Electrical Engineering and Computer Science, The University of Tennessee, Knoxville, TN, USA
– sequence: 2
  givenname: Bojian
  orcidid: 0000-0001-5642-6826
  surname: Xu
  fullname: Xu, Bojian
  organization: Department of Computer Science, Eastern Washington University, Cheney, WA, USA
BackLink https://www.ncbi.nlm.nih.gov/pubmed/31425048$$D View this record in MEDLINE/PubMed
BookMark eNp9kDtPwzAUhS1URB_wA1hQRpYUP-N4pBUtSK2o1HaOHOeGGtKk2O7AvydRCwMD073Dd46OviHq1U0NCN0SPCYEq4fNdDIZU0zUmComcEIu0IAIIWOlEt7rfi5ioRLWR0Pv3zGmXGF-hfqMcCowTwdosdJOVxVU0RLCril8VDYumtm6sPVb9BEvrd_rYHbRete4AD5E29p-HiFaH3MfXAv5aOs7dr7aXqPLUlcebs53hDazp830OV68zl-mj4vYUCVDnAqWs1LzMgVOy1xwRnPO024RkUmS5sxgbkQKsoACtEylNJQpCiCBaMNG6P5Ue3BNO8WHbG-9garSNTRHn1HGEpVSjmmL3p3RY76HIjs4u9fuK_sR0ALkBBjXeO-g_EUIzjrJWSc56yRnZ8ltRv7JGBt0sE0dnLbVP8lv9uZ_Mw
CitedBy_id crossref_primary_10_3390_a13090224
crossref_primary_10_3390_a13090234
crossref_primary_10_1371_journal_pone_0251047
Cites_doi 10.1007/978-3-319-07566-2_18
10.1016/j.tcs.2014.11.004
10.1007/3-540-48194-X_17
10.1007/978-3-319-94968-0_18
10.1007/978-3-319-04298-5_44
10.1007/978-3-319-11918-2_16
10.1007/978-3-319-18120-2_19
10.1007/978-3-662-48971-0_63
10.1186/1471-2105-6-123
10.1016/j.tcs.2017.05.032
ContentType Journal Article
DBID AAYXX
CITATION
CGR
CUY
CVF
ECM
EIF
NPM
7X8
DOI 10.1109/TCBB.2019.2935061
DatabaseName CrossRef
Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
MEDLINE - Academic
DatabaseTitle CrossRef
MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
MEDLINE - Academic
DatabaseTitleList MEDLINE - Academic
MEDLINE
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: EIF
  name: MEDLINE
  url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search
  sourceTypes: Index Database
DeliveryMethod fulltext_linktorsrc
Discipline Biology
EISSN 1557-9964
EndPage 395
ExternalDocumentID 31425048
10_1109_TCBB_2019_2935061
Genre Journal Article
GroupedDBID 0R~
29I
4.4
53G
5GY
5VS
6IK
8US
97E
AAJGR
AAKMM
AALFJ
AASAJ
AAWTH
AAWTV
AAYXX
ABAZT
ABQJQ
ABVLG
ACGFO
ACGFS
ACIWK
ACM
ACPRK
ADBCU
ADL
AEBYY
AEFXT
AEJOY
AENEX
AENSD
AFRAH
AFWIH
AFWXC
AGQYO
AHBIQ
AIKLT
AKJIK
AKQYR
AKRVB
ALMA_UNASSIGNED_HOLDINGS
ASPBG
ATWAV
AVWKF
BDXCO
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CCLIF
CITATION
CS3
DU5
EBS
EJD
FEDTE
GUFHI
HGAVV
HZ~
I07
IEDLZ
IFIPE
IPLJI
JAVBF
LAI
LHSKQ
M43
O9-
OCL
P1C
P2P
PQQKQ
RIA
RIE
RNS
ROL
TN5
AAYOK
ADPZR
AETIX
AGSQL
AIBXA
CGR
CUY
CVF
ECM
EIF
NPM
RIG
RNI
RZB
W7O
XOL
7X8
ID FETCH-LOGICAL-c297t-853b3fa4f8e42fb5432b448250417668b3c04c58e7dedea7877c2392ee7e1ac3
ISSN 1545-5963
1557-9964
IngestDate Sun Sep 28 10:00:34 EDT 2025
Thu Apr 03 07:07:55 EDT 2025
Sat Oct 25 04:05:12 EDT 2025
Thu Apr 24 22:57:18 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 1
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c297t-853b3fa4f8e42fb5432b448250417668b3c04c58e7dedea7877c2392ee7e1ac3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ORCID 0000-0001-5642-6826
0000-0003-3912-2841
PMID 31425048
PQID 2336982402
PQPubID 23479
PageCount 10
ParticipantIDs proquest_miscellaneous_2336982402
pubmed_primary_31425048
crossref_primary_10_1109_TCBB_2019_2935061
crossref_citationtrail_10_1109_TCBB_2019_2935061
PublicationCentury 2000
PublicationDate 2021-1-1
2021 Jan-Feb
20210101
PublicationDateYYYYMMDD 2021-01-01
PublicationDate_xml – month: 01
  year: 2021
  text: 2021-1-1
  day: 01
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle IEEE/ACM transactions on computational biology and bioinformatics
PublicationTitleAlternate IEEE/ACM Trans Comput Biol Bioinform
PublicationYear 2021
References mieno (ref11) 2017
ref13
ref14
(ref15) 2017
ref2
ref1
mieno (ref10) 2016
ref8
ref7
ref9
(ref17) 2017
ref4
ref6
ref5
(ref16) 2017
wang (ref12) 2015
pei (ref3) 2013
References_xml – start-page: 24:1
  year: 2017
  ident: ref11
  article-title: Tight bounds on the maximum number of shortest unique substrings
  publication-title: Proc Ann Symp Combinatorial Pattern Matching
– ident: ref5
  doi: 10.1007/978-3-319-07566-2_18
– ident: ref6
  doi: 10.1016/j.tcs.2014.11.004
– ident: ref13
  doi: 10.1007/3-540-48194-X_17
– start-page: 573
  year: 2015
  ident: ref12
  publication-title: Fast Parallel Suffix Array on the GPU
– start-page: 937
  year: 2013
  ident: ref3
  article-title: On shortest unique substring queries
  publication-title: Proc IEEE Int Conf Data Eng
– ident: ref1
  doi: 10.1007/978-3-319-94968-0_18
– start-page: 69:1
  year: 2016
  ident: ref10
  article-title: Shortest unique substring queries on run-length encoded strings
  publication-title: Proc Int Symp Math Found Comput Sci
– year: 2017
  ident: ref16
– ident: ref4
  doi: 10.1007/978-3-319-04298-5_44
– ident: ref7
  doi: 10.1007/978-3-319-11918-2_16
– ident: ref14
  doi: 10.1007/978-3-319-18120-2_19
– ident: ref8
  doi: 10.1007/978-3-662-48971-0_63
– year: 2017
  ident: ref15
– ident: ref2
  doi: 10.1186/1471-2105-6-123
– ident: ref9
  doi: 10.1016/j.tcs.2017.05.032
– year: 2017
  ident: ref17
SSID ssj0024904
Score 2.2611654
Snippet k-mismatch shortest unique substring (SUS) queries have been proposed and studied very recently due to its useful applications in the subfield of computational...
SourceID proquest
pubmed
crossref
SourceType Aggregation Database
Index Database
Enrichment Source
StartPage 386
SubjectTerms Algorithms
Computational Biology - methods
Computer Graphics
Image Processing, Computer-Assisted
Sequence Analysis, DNA - methods
Title Parallel Methods for Finding k-Mismatch Shortest Unique Substrings Using GPU
URI https://www.ncbi.nlm.nih.gov/pubmed/31425048
https://www.proquest.com/docview/2336982402
Volume 18
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 1557-9964
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0024904
  issn: 1545-5963
  databaseCode: RIE
  dateStart: 20040101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Zb9NAEF6FIiReKm7CpUXiiWiDvUdsP9KopUIEVWoq8mbZ67VSWiUocR7or2dmD8e9EOXFcpxkfMzn2ZndmW8I-VCXKuGpilitwX2TWSxZkYxqZqSOKxgussjO6U6-jw5P5NeZmvV6F93qkqYc6osb60r-R6twDPSKVbJ30GwrFA7APugXtqBh2P6Tjo-KFbZCwdxU7ANtqRUGB6euUOWMTU7X4I_q-eB4jim16wY9TKRrRWvRrGzDTpcy8OXopOulYgSIRAvjCbaQCP3E7cKCtl0gwgxil8IJ9j0La9PJoD_W88257Rnrq9kHP4bhu9nGomv5M0DUzz7wuDP74A2mVExl3kgZf0wlDOIoebOVbdHkTKYIVNjuk2u5ed2wW17U6XhvD9PxsiF4KSpyNO6XSbSvDG5tyqENdqIsRxE5isi9iHvkPocRIXK1f1uexsy2nmzvz6-Ig4hP167isk9zS6BiHZbpI7LrIw362cHmMemZxRPywPUe_f2UfAvgoR48FHRHPXjoFjw0gIc68NAteKgFDwXwPCPTg_3p-JD51hpM8yxpGDhppagLWadGcnhfpeAlBOrIZ4eMoWkpdCS1Sk1SmcoUYNUTzcGVNiYxcaHFc7KzWC7MS0Kl0objar4B15ubKq0kkhiWlREiLpKoT6LwbHLtaeex-8l5fqtG-uRj-5dfjnPlbz9-Hx54DpYRl7uKhVlu1jkXYpSluHrYJy-cJlpxAq4V7jV9dZdTvSYPt6_AG7LTrDbmLbikTfnOYucPSIWE1w
linkProvider IEEE
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Parallel+Methods+for+Finding+k-Mismatch+Shortest+Unique+Substrings+Using+GPU&rft.jtitle=IEEE%2FACM+transactions+on+computational+biology+and+bioinformatics&rft.au=Schultz%2C+Daniel+W.&rft.au=Xu%2C+Bojian&rft.date=2021-01-01&rft.issn=1545-5963&rft.eissn=1557-9964&rft.volume=18&rft.issue=1&rft.spage=386&rft.epage=395&rft_id=info:doi/10.1109%2FTCBB.2019.2935061&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TCBB_2019_2935061
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1545-5963&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1545-5963&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1545-5963&client=summon