SeAlM: A Query Cache Optimization Technique for Next Generation Sequence Alignment

Genetic data from next-generation sequencing (NGS) technology is being produced at an ever increasing rate - already outpacing the well known Moore's Law. Due to this pace of NGS data generation, new methods are necessary in order to facilitate rapid sequence analysis at the enormous scale requ...

Full description

Saved in:
Bibliographic Details
Published inIEEE ... International Conference on Data Mining workshops pp. 958 - 965
Main Authors Stene, Evan, Banaei-Kashani, Farnoush
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.11.2019
Subjects
Online AccessGet full text
ISSN2375-9259
DOI10.1109/ICDMW.2019.00139

Cover

Abstract Genetic data from next-generation sequencing (NGS) technology is being produced at an ever increasing rate - already outpacing the well known Moore's Law. Due to this pace of NGS data generation, new methods are necessary in order to facilitate rapid sequence analysis at the enormous scale required. The need for such methods is further compounded by the dropping financial cost of sequencing, leading to the normalization of large-scale genome studies spanning entire populations. A key process in the genomic data analysis pipeline, and one that is often most time consuming, is read mapping or so-called alignment. This paper introduces Sequence Alignment Memorizer (SeAlM), a technique that reduces the number of redundant alignments to enable population-scale workloads. SeAlM uses a novel method for reordering alignment queries from multiple sources to create batches with increased likelihood of containing redundant queries that can be de-duplicated before alignment, while also ordering those batches to improve the ability to cache queries effectively. We show that our technique can improve the average throughput of alignment for a single human sample by 6.5% and a population of 10 human subjects by 13.6% -18.8% depending on the type of genetic data used.
AbstractList Genetic data from next-generation sequencing (NGS) technology is being produced at an ever increasing rate - already outpacing the well known Moore's Law. Due to this pace of NGS data generation, new methods are necessary in order to facilitate rapid sequence analysis at the enormous scale required. The need for such methods is further compounded by the dropping financial cost of sequencing, leading to the normalization of large-scale genome studies spanning entire populations. A key process in the genomic data analysis pipeline, and one that is often most time consuming, is read mapping or so-called alignment. This paper introduces Sequence Alignment Memorizer (SeAlM), a technique that reduces the number of redundant alignments to enable population-scale workloads. SeAlM uses a novel method for reordering alignment queries from multiple sources to create batches with increased likelihood of containing redundant queries that can be de-duplicated before alignment, while also ordering those batches to improve the ability to cache queries effectively. We show that our technique can improve the average throughput of alignment for a single human sample by 6.5% and a population of 10 human subjects by 13.6% -18.8% depending on the type of genetic data used.
Author Stene, Evan
Banaei-Kashani, Farnoush
Author_xml – sequence: 1
  givenname: Evan
  surname: Stene
  fullname: Stene, Evan
  organization: University of Colorado, Denver
– sequence: 2
  givenname: Farnoush
  surname: Banaei-Kashani
  fullname: Banaei-Kashani, Farnoush
  organization: University of Colorado, Denver
BookMark eNotkE1Lw0AYhFdRsK29C172DyS-u5vdZL2FWGuhtWgDHks-3rUrybYmKVh_vQt1Ls9hhmGYMblye4eE3DEIGQP9sMieVh8hB6ZDACb0BRmzmCcsSrSCSzLiIpaB5lLfkGnff4EPaRFpzUfkfYNps3qkKX07YneiWVHtkK4Pg23tbzHYvaM5Vjtnv49Izb6jr_gz0Dk67M7uBr3jKqRpYz9di264JdemaHqc_nNC8udZnr0Ey_V8kaXLwHIQQyAxKWO_vqqjCBITg-Gl9KpELUtUkeIxcoSax7UyBdegdB2VxghZKvCYkPtzrUXE7aGzbdGdtomWUvkP_gD8FFDI
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICDMW.2019.00139
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Statistics
Computer Science
EISBN 1728148960
9781728148960
EISSN 2375-9259
EndPage 965
ExternalDocumentID 8955601
Genre orig-research
GroupedDBID 6IE
6IF
6IH
6IK
6IL
6IN
AAJGR
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
OCL
RIE
RIL
RNS
ID FETCH-LOGICAL-i203t-5e8b7109cd4408f70f2b5555c3d5be64627e2e0d27d6fa29069d4bff35b60ff3
IEDL.DBID RIE
IngestDate Wed Aug 27 02:33:46 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-5e8b7109cd4408f70f2b5555c3d5be64627e2e0d27d6fa29069d4bff35b60ff3
PageCount 8
ParticipantIDs ieee_primary_8955601
PublicationCentury 2000
PublicationDate 2019-Nov.
PublicationDateYYYYMMDD 2019-11-01
PublicationDate_xml – month: 11
  year: 2019
  text: 2019-Nov.
PublicationDecade 2010
PublicationTitle IEEE ... International Conference on Data Mining workshops
PublicationTitleAbbrev ICDMW
PublicationYear 2019
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0001934992
Score 1.706546
Snippet Genetic data from next-generation sequencing (NGS) technology is being produced at an ever increasing rate - already outpacing the well known Moore's Law. Due...
SourceID ieee
SourceType Publisher
StartPage 958
SubjectTerms alignment
Bioinformatics
Genomics
Indexes
next generation sequencing
query cache optimization
read mapping
Sequential analysis
Sociology
Statistics
Title SeAlM: A Query Cache Optimization Technique for Next Generation Sequence Alignment
URI https://ieeexplore.ieee.org/document/8955601
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LT8IwGP8CnDihgPGdHjw6GF33qDeCEjQZPsDIjazrV0NEMGQ76F9vuwcY48Fe1mxLurRZv6_t7wFwoQSNVNRzLEWVsBjvCUt4yrY4UzTW4SuImSE4h2Nv9MzuZu6sApdbLgwiZuAz7JhqdpYv13Fqtsq6AXfNAqIKVT_wcq7Wbj-FOzp5p-VJpM27t4Pr8MWAtzJFSuMG_sM_JQsfwwaEZcM5auStkyaiE3_90mT875ftQXtH1CMP2xC0DxVcNaFROjWQ4sdtQt3klLkkcwueJthfhlekTx5T3HySgdF0Jvd67ngvSJlkWiq7Ep3TkrGewEmuT509nRTwa9JfLl4zNEEbpsOb6WBkFdYK1oLaTmK5GAiDwoylcZxWvq2ocHWJHekK9JhHfaRoS-pLT0VGEp5LJpRyXOHZ-nIAtdV6hYdAhCNsoV9mLkoWURWowJfIuOJGTJD1jqBlumv-kYtnzIueOv779gnUzYDlZL9TqCWbFM901E_EeTbc39J6ra8
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LT8IwGP-CeJATChjf9uDRwejaPbwRlIAyfDAjN7KurSEiGLId9K-33QOM8WAva7YlXdqs39f29wC4kAyHMmxbhsSSGcRrM4PZ0jQ8InGkwpcbEU1w9kd2_5ncTuikBJdrLowQIgWfiaaupmf5fBklequs5XpULyC2YJsSQmjG1trsqHiWSt9xcRZpeq1B99p_0fCtVJNS-4H_cFBJA0ivCn7RdIYbeWsmMWtGX79UGf_7bbvQ2FD10MM6CO1BSSxqUC28GlD-69agorPKTJS5Dk9j0Zn7V6iDHhOx-kRdreqM7tXs8Z7TMlFQaLsildWikZrCUaZQnT4d5wBs1JnPXlM8QQOC3k3Q7Ru5uYIxw6YVG1S4TOMwI649p6VjSsyoKpHFKRM2sbEjsDA5drgtQy0K73HCpLQos0112YfyYrkQB4CYxUymXiZUcBJi6UrX4YJ40tNygqR9CHXdXdOPTD5jmvfU0d-3z2GnH_jD6XAwujuGih68jPp3AuV4lYhTlQPE7Cwd-m8ZaLD8
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=IEEE+...+International+Conference+on+Data+Mining+workshops&rft.atitle=SeAlM%3A+A+Query+Cache+Optimization+Technique+for+Next+Generation+Sequence+Alignment&rft.au=Stene%2C+Evan&rft.au=Banaei-Kashani%2C+Farnoush&rft.date=2019-11-01&rft.pub=IEEE&rft.eissn=2375-9259&rft.spage=958&rft.epage=965&rft_id=info:doi/10.1109%2FICDMW.2019.00139&rft.externalDocID=8955601