A graph-based algorithm for RNA-seq data normalization

The use of RNA-sequencing has garnered much attention in recent years for characterizing and understanding various biological systems. However, it remains a major challenge to gain insights from a large number of RNA-seq experiments collectively, due to the normalization problem. Normalization has b...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 15; no. 1; p. e0227760
Main Authors	Tran, Diem-Trang, Bhaskara, Aditya, Kuberan, Balagurunathan, Might, Matthew
Format	Journal Article
Language	English
Published	United States Public Library of Science 24.01.2020 Public Library of Science (PLoS)
Subjects	Algorithms Apexes Biology and Life Sciences Circularity Computational Biology - methods Data Science - methods Databases, Genetic - statistics & numerical data Feasibility Studies Gene sequencing Genes Hypotheses Methods Physical Sciences Research and Analysis Methods Ribonucleic acid RNA RNA-Seq - methods RNA-Seq - statistics & numerical data United States > US Utah
Online Access	Get full text
ISSN	1932-6203 1932-6203
DOI	10.1371/journal.pone.0227760

Cover

More Information
Summary:	The use of RNA-sequencing has garnered much attention in recent years for characterizing and understanding various biological systems. However, it remains a major challenge to gain insights from a large number of RNA-seq experiments collectively, due to the normalization problem. Normalization has been challenging due to an inherent circularity, requiring that RNA-seq data be normalized before any pattern of differential (or non-differential) expression can be ascertained; meanwhile, the prior knowledge of non-differential transcripts is crucial to the normalization process. Some methods have successfully overcome this problem by the assumption that most transcripts are not differentially expressed. However, when RNA-seq profiles become more abundant and heterogeneous, this assumption fails to hold, leading to erroneous normalization. We present a normalization procedure that does not rely on this assumption, nor prior knowledge about the reference transcripts. This algorithm is based on a graph constructed from intrinsic correlations among RNA-seq transcripts and seeks to identify a set of densely connected vertices as references. Application of this algorithm on our synthesized validation data showed that it could recover the reference transcripts with high precision, thus resulting in high-quality normalization. On a realistic data set from the ENCODE project, this algorithm gave good results and could finish in a reasonable time. These preliminary results imply that we may be able to break the long persisting circularity problem in RNA-seq normalization.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Competing Interests: The authors have declared that no competing interests exist.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0227760