The Poisson distribution model fits UMI-based single-cell RNA-sequencing data

Background Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggreg...

Full description

Saved in:

Bibliographic Details
Published in	BMC bioinformatics Vol. 24; no. 1; pp. 256 - 27
Main Authors	Pan, Yue, Landis, Justin T., Moorad, Razia, Wu, Di, Marron, J. S., Dittmer, Dirk P.
Format	Journal Article
Language	English
Published	London BioMed Central 17.06.2023 BioMed Central Ltd Springer Nature B.V BMC
Subjects	Accuracy Algorithms Analysis Approximation Bioinformatics Biomedical and Life Sciences Cluster Analysis Clustering Clustering (Computers) Computational Biology/Bioinformatics Computer Appl. in Life Sciences Computer simulation Data representation Electronic data processing Gene Expression Profiling - methods Gene sequencing Genes Heterogeneity Hypotheses Hypothesis testing Information management Life Sciences Mathematical models Methods Microarrays Modelling Optimization Parameter estimation Parameters Poisson Distribution Probability Representations Ribonucleic acid RNA RNA - genetics RNA sequencing RNA-seq Sequence Analysis, RNA - methods Single cell Single-Cell Analysis - methods Single-cell technologies Statistical models United States Data representation Single cell RNA-seq Poisson distribution
Online Access	Get full text
ISSN	1471-2105 1471-2105
DOI	10.1186/s12859-023-05349-2

Cover

More Information
Summary:	Background Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. Results We avoid the crude approximations entailed by such aggregation through proposing an independent Poisson distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. Conclusions This new method has multiple advantages, including (1) no need for prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson .
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1471-2105 1471-2105
DOI:	10.1186/s12859-023-05349-2