Predictive Analytics on Genomic Data with High-Performance Computing

Recent technological advancements and scientific discoveries have revolutionized the current era of genomics. Next-generation sequencing (NGS) technologies have led to tremendous reduction in the sequencing time and given rise to the production and collection of high volumes of genomic datasets. Pre...

Full description

Saved in:

Bibliographic Details
Published in	2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) pp. 2187 - 2194
Main Authors	Leung, Carson K., Sarumi, Oluwafemi A., Zhang, Christine Y.
Format	Conference Proceeding
Language	English
Published	IEEE 16.12.2020
Subjects	Apache Spark big data Bioinformatics data mining data science gene prediction Genomics high performance computing machine learning Organisms Prediction algorithms Proteins Sparks Training
Online Access	Get full text
DOI	10.1109/BIBM49941.2020.9312982

Cover

More Information
Summary:	Recent technological advancements and scientific discoveries have revolutionized the current era of genomics. Next-generation sequencing (NGS) technologies have led to tremendous reduction in the sequencing time and given rise to the production and collection of high volumes of genomic datasets. Predicting protein-coding genes from these copious genomic datasets is significant for the synthesis of protein and the understating of the regulatory function of the non-coding region. Methods have been developed to find protein-coding genes from the genome of organisms. Notwithstanding, the recent data explosion in genomics accentuates the need for more efficient algorithms for gene prediction. In this paper, we explore predictive analytics on genomic data. In particular, we present a scalable naïve Bayes-based algorithm that is deployed over a cluster of Apache Spark framework for efficient prediction of genes in the genome of eukaryotic organisms. Evaluation results on the human genome chromosome GRCh37 and GRCh38 show that effectiveness of our algorithm for predictive analytics on genomic data with high-performance computing. high sensitivity, specificity and accuracy.
DOI:	10.1109/BIBM49941.2020.9312982