Predictive Analytics on Genomic Data with High-Performance Computing

Recent technological advancements and scientific discoveries have revolutionized the current era of genomics. Next-generation sequencing (NGS) technologies have led to tremendous reduction in the sequencing time and given rise to the production and collection of high volumes of genomic datasets. Pre...

Full description

Saved in:
Bibliographic Details
Published in2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) pp. 2187 - 2194
Main Authors Leung, Carson K., Sarumi, Oluwafemi A., Zhang, Christine Y.
Format Conference Proceeding
LanguageEnglish
Published IEEE 16.12.2020
Subjects
Online AccessGet full text
DOI10.1109/BIBM49941.2020.9312982

Cover

More Information
Summary:Recent technological advancements and scientific discoveries have revolutionized the current era of genomics. Next-generation sequencing (NGS) technologies have led to tremendous reduction in the sequencing time and given rise to the production and collection of high volumes of genomic datasets. Predicting protein-coding genes from these copious genomic datasets is significant for the synthesis of protein and the understating of the regulatory function of the non-coding region. Methods have been developed to find protein-coding genes from the genome of organisms. Notwithstanding, the recent data explosion in genomics accentuates the need for more efficient algorithms for gene prediction. In this paper, we explore predictive analytics on genomic data. In particular, we present a scalable naïve Bayes-based algorithm that is deployed over a cluster of Apache Spark framework for efficient prediction of genes in the genome of eukaryotic organisms. Evaluation results on the human genome chromosome GRCh37 and GRCh38 show that effectiveness of our algorithm for predictive analytics on genomic data with high-performance computing. high sensitivity, specificity and accuracy.
DOI:10.1109/BIBM49941.2020.9312982