SeAlM: A Query Cache Optimization Technique for Next Generation Sequence Alignment

Genetic data from next-generation sequencing (NGS) technology is being produced at an ever increasing rate - already outpacing the well known Moore's Law. Due to this pace of NGS data generation, new methods are necessary in order to facilitate rapid sequence analysis at the enormous scale requ...

Full description

Saved in:

Bibliographic Details
Published in	IEEE ... International Conference on Data Mining workshops pp. 958 - 965
Main Authors	Stene, Evan, Banaei-Kashani, Farnoush
Format	Conference Proceeding
Language	English
Published	IEEE 01.11.2019
Subjects	alignment Bioinformatics Genomics Indexes next generation sequencing query cache optimization read mapping Sequential analysis Sociology Statistics
Online Access	Get full text
ISSN	2375-9259
DOI	10.1109/ICDMW.2019.00139

Cover

Abstract	Genetic data from next-generation sequencing (NGS) technology is being produced at an ever increasing rate - already outpacing the well known Moore's Law. Due to this pace of NGS data generation, new methods are necessary in order to facilitate rapid sequence analysis at the enormous scale required. The need for such methods is further compounded by the dropping financial cost of sequencing, leading to the normalization of large-scale genome studies spanning entire populations. A key process in the genomic data analysis pipeline, and one that is often most time consuming, is read mapping or so-called alignment. This paper introduces Sequence Alignment Memorizer (SeAlM), a technique that reduces the number of redundant alignments to enable population-scale workloads. SeAlM uses a novel method for reordering alignment queries from multiple sources to create batches with increased likelihood of containing redundant queries that can be de-duplicated before alignment, while also ordering those batches to improve the ability to cache queries effectively. We show that our technique can improve the average throughput of alignment for a single human sample by 6.5% and a population of 10 human subjects by 13.6% -18.8% depending on the type of genetic data used.
AbstractList	Genetic data from next-generation sequencing (NGS) technology is being produced at an ever increasing rate - already outpacing the well known Moore's Law. Due to this pace of NGS data generation, new methods are necessary in order to facilitate rapid sequence analysis at the enormous scale required. The need for such methods is further compounded by the dropping financial cost of sequencing, leading to the normalization of large-scale genome studies spanning entire populations. A key process in the genomic data analysis pipeline, and one that is often most time consuming, is read mapping or so-called alignment. This paper introduces Sequence Alignment Memorizer (SeAlM), a technique that reduces the number of redundant alignments to enable population-scale workloads. SeAlM uses a novel method for reordering alignment queries from multiple sources to create batches with increased likelihood of containing redundant queries that can be de-duplicated before alignment, while also ordering those batches to improve the ability to cache queries effectively. We show that our technique can improve the average throughput of alignment for a single human sample by 6.5% and a population of 10 human subjects by 13.6% -18.8% depending on the type of genetic data used.
Author	Stene, Evan Banaei-Kashani, Farnoush
Author_xml	– sequence: 1 givenname: Evan surname: Stene fullname: Stene, Evan organization: University of Colorado, Denver – sequence: 2 givenname: Farnoush surname: Banaei-Kashani fullname: Banaei-Kashani, Farnoush organization: University of Colorado, Denver
BookMark	eNotkE1Lw0AYhFdRsK29C172DyS-u5vdZL2FWGuhtWgDHks-3rUrybYmKVh_vQt1Ls9hhmGYMblye4eE3DEIGQP9sMieVh8hB6ZDACb0BRmzmCcsSrSCSzLiIpaB5lLfkGnff4EPaRFpzUfkfYNps3qkKX07YneiWVHtkK4Pg23tbzHYvaM5Vjtnv49Izb6jr_gz0Dk67M7uBr3jKqRpYz9di264JdemaHqc_nNC8udZnr0Ey_V8kaXLwHIQQyAxKWO_vqqjCBITg-Gl9KpELUtUkeIxcoSax7UyBdegdB2VxghZKvCYkPtzrUXE7aGzbdGdtomWUvkP_gD8FFDI
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/ICDMW.2019.00139
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Statistics Computer Science
EISBN	1728148960 9781728148960
EISSN	2375-9259
EndPage	965
ExternalDocumentID	8955601
Genre	orig-research
GroupedDBID	6IE 6IF 6IH 6IK 6IL 6IN AAJGR AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI OCL RIE RIL RNS
ID	FETCH-LOGICAL-i203t-5e8b7109cd4408f70f2b5555c3d5be64627e2e0d27d6fa29069d4bff35b60ff3
IEDL.DBID	RIE
IngestDate	Wed Aug 27 02:33:46 EDT 2025
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i203t-5e8b7109cd4408f70f2b5555c3d5be64627e2e0d27d6fa29069d4bff35b60ff3
PageCount	8
ParticipantIDs	ieee_primary_8955601
PublicationCentury	2000
PublicationDate	2019-Nov.
PublicationDateYYYYMMDD	2019-11-01
PublicationDate_xml	– month: 11 year: 2019 text: 2019-Nov.
PublicationDecade	2010
PublicationTitle	IEEE ... International Conference on Data Mining workshops
PublicationTitleAbbrev	ICDMW
PublicationYear	2019
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0001934992
Score	1.706546
Snippet	Genetic data from next-generation sequencing (NGS) technology is being produced at an ever increasing rate - already outpacing the well known Moore's Law. Due...
SourceID	ieee
SourceType	Publisher
StartPage	958
SubjectTerms	alignment Bioinformatics Genomics Indexes next generation sequencing query cache optimization read mapping Sequential analysis Sociology Statistics
Title	SeAlM: A Query Cache Optimization Technique for Next Generation Sequence Alignment
URI	https://ieeexplore.ieee.org/document/8955601
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LT8IwGP8CnDihgPGdHjw6GF33qDeCEjQZPsDIjazrV0NEMGQ76F9vuwcY48Fe1mxLurRZv6_t7wFwoQSNVNRzLEWVsBjvCUt4yrY4UzTW4SuImSE4h2Nv9MzuZu6sApdbLgwiZuAz7JhqdpYv13Fqtsq6AXfNAqIKVT_wcq7Wbj-FOzp5p-VJpM27t4Pr8MWAtzJFSuMG_sM_JQsfwwaEZcM5auStkyaiE3_90mT875ftQXtH1CMP2xC0DxVcNaFROjWQ4sdtQt3klLkkcwueJthfhlekTx5T3HySgdF0Jvd67ngvSJlkWiq7Ep3TkrGewEmuT509nRTwa9JfLl4zNEEbpsOb6WBkFdYK1oLaTmK5GAiDwoylcZxWvq2ocHWJHekK9JhHfaRoS-pLT0VGEp5LJpRyXOHZ-nIAtdV6hYdAhCNsoV9mLkoWURWowJfIuOJGTJD1jqBlumv-kYtnzIueOv779gnUzYDlZL9TqCWbFM901E_EeTbc39J6ra8
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LT8IwGP-CeJATChjf9uDRwejaPbwRlIAyfDAjN7KurSEiGLId9K-33QOM8WAva7YlXdqs39f29wC4kAyHMmxbhsSSGcRrM4PZ0jQ8InGkwpcbEU1w9kd2_5ncTuikBJdrLowQIgWfiaaupmf5fBklequs5XpULyC2YJsSQmjG1trsqHiWSt9xcRZpeq1B99p_0fCtVJNS-4H_cFBJA0ivCn7RdIYbeWsmMWtGX79UGf_7bbvQ2FD10MM6CO1BSSxqUC28GlD-69agorPKTJS5Dk9j0Zn7V6iDHhOx-kRdreqM7tXs8Z7TMlFQaLsildWikZrCUaZQnT4d5wBs1JnPXlM8QQOC3k3Q7Ru5uYIxw6YVG1S4TOMwI649p6VjSsyoKpHFKRM2sbEjsDA5drgtQy0K73HCpLQos0112YfyYrkQB4CYxUymXiZUcBJi6UrX4YJ40tNygqR9CHXdXdOPTD5jmvfU0d-3z2GnH_jD6XAwujuGih68jPp3AuV4lYhTlQPE7Cwd-m8ZaLD8
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=IEEE+...+International+Conference+on+Data+Mining+workshops&rft.atitle=SeAlM%3A+A+Query+Cache+Optimization+Technique+for+Next+Generation+Sequence+Alignment&rft.au=Stene%2C+Evan&rft.au=Banaei-Kashani%2C+Farnoush&rft.date=2019-11-01&rft.pub=IEEE&rft.eissn=2375-9259&rft.spage=958&rft.epage=965&rft_id=info:doi/10.1109%2FICDMW.2019.00139&rft.externalDocID=8955601