Analysis and correction of compositional bias in sparse sequencing count data

Background Count data derived from high-throughput deoxy-ribonucliec acid (DNA) sequencing is frequently used in quantitative molecular assays. Due to properties inherent to the sequencing process, unnormalized count data is compositional, measuring relative and not absolute abundances of the assaye...

Full description

Saved in:

Bibliographic Details
Published in	BMC genomics Vol. 19; no. 1; pp. 799 - 23
Main Authors	Kumar, M. Senthil, Slud, Eric V., Okrah, Kwame, Hicks, Stephanie C., Hannenhalli, Sridhar, Corrada Bravo, Héctor
Format	Journal Article
Language	English
Published	London BioMed Central 06.11.2018 BioMed Central Ltd Springer Nature B.V BMC
Subjects	Algorithms Analysis Animal Genetics and Genomics Bayes Theorem Bayesian analysis Bias Biomedical and Life Sciences Compositional bias Computational Biology - methods Count data Data integration Deoxyribonucleic acid DNA DNA sequencing Empirical analysis Empirical Bayes Experiments Gene expression Genomics High-Throughput Nucleotide Sequencing - methods Hypothesis testing Life Sciences Metagenomics Metagenomics - methods Methodology Methodology Article Microarrays Microbial Genetics and Genomics Microbiota Normalization Oceans Plant Genetics and Genomics Proteomics Rarefaction RNA, Ribosomal, 16S - genetics Scaling Science Selection bias Transcriptomic methods Empirical Bayes Metagenomics Count data Absolute abundance Data integration Spike-in Normalization Compositional bias scRNAseq
Online Access	Get full text
ISSN	1471-2164 1471-2164
DOI	10.1186/s12864-018-5160-5

Cover

More Information
Summary:	Background Count data derived from high-throughput deoxy-ribonucliec acid (DNA) sequencing is frequently used in quantitative molecular assays. Due to properties inherent to the sequencing process, unnormalized count data is compositional, measuring relative and not absolute abundances of the assayed features. This compositional bias confounds inference of absolute abundances. Commonly used count data normalization approaches like library size scaling/rarefaction/subsampling cannot correct for compositional or any other relevant technical bias that is uncorrelated with library size. Results We demonstrate that existing techniques for estimating compositional bias fail with sparse metagenomic 16S count data and propose an empirical Bayes normalization approach to overcome this problem. In addition, we clarify the assumptions underlying frequently used scaling normalization methods in light of compositional bias, including scaling methods that were not designed directly to address it. Conclusions Compositional bias, induced by the sequencing machine, confounds inferences of absolute abundances. We present a normalization technique for compositional bias correction in sparse sequencing count data, and demonstrate its improved performance in metagenomic 16s survey data. Based on the distribution of technical bias estimates arising from several publicly available large scale 16s count datasets, we argue that detailed experiments specifically addressing the influence of compositional bias in metagenomics are needed.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1471-2164 1471-2164
DOI:	10.1186/s12864-018-5160-5