Inferring cellular and molecular processes in single-cell data with non-negative matrix factorization using Python, R and GenePattern Notebook implementations of CoGAPS

Non-negative matrix factorization (NMF) is an unsupervised learning method well suited to high-throughput biology. However, inferring biological processes from an NMF result still requires additional post hoc statistics and annotation for interpretation of learned features. Here, we introduce a suit...

Full description

Saved in:
Bibliographic Details
Published inNature protocols Vol. 18; no. 12; pp. 3690 - 3731
Main Authors Johnson, Jeanette A. I., Tsang, Ashley P., Mitchell, Jacob T., Zhou, David L., Bowden, Julia, Davis-Marcisak, Emily, Sherman, Thomas, Liefeld, Ted, Loth, Melanie, Goff, Loyal A., Zimmerman, Jacquelyn W., Kinny-Köster, Ben, Jaffee, Elizabeth M., Tamayo, Pablo, Mesirov, Jill P., Reich, Michael, Fertig, Elana J., Stein-O’Brien, Genevieve L.
Format Journal Article
LanguageEnglish
Published London Nature Publishing Group UK 01.12.2023
Nature Publishing Group
Subjects
Online AccessGet full text
ISSN1754-2189
1750-2799
1750-2799
DOI10.1038/s41596-023-00892-x

Cover

More Information
Summary:Non-negative matrix factorization (NMF) is an unsupervised learning method well suited to high-throughput biology. However, inferring biological processes from an NMF result still requires additional post hoc statistics and annotation for interpretation of learned features. Here, we introduce a suite of computational tools that implement NMF and provide methods for accurate and clear biological interpretation and analysis. A generalized discussion of NMF covering its benefits, limitations and open questions is followed by four procedures for the Bayesian NMF algorithm Coordinated Gene Activity across Pattern Subsets (CoGAPS). Each procedure will demonstrate NMF analysis to quantify cell state transitions in a public domain single-cell RNA-sequencing dataset. The first demonstrates PyCoGAPS, our new Python implementation that enhances runtime for large datasets, and the second allows its deployment in Docker. The third procedure steps through the same single-cell NMF analysis using our R CoGAPS interface. The fourth introduces a beginner-friendly CoGAPS platform using GenePattern Notebook, aimed at users with a working conceptual knowledge of data analysis but without a basic proficiency in the R or Python programming language. We also constructed a user-facing website to serve as a central repository for information and instructional materials about CoGAPS and its application programming interfaces. The expected timing to setup the packages and conduct a test run is around 15 min, and an additional 30 min to conduct analyses on a precomputed result. The expected runtime on the user’s desired dataset can vary from hours to days depending on factors such as dataset size or input parameters. Key points This protocol describes procedures for learning cellular and molecular processes from single-cell RNA-sequencing data using the non-negative matrix factorization algorithm Coordinated Gene Activity across Pattern Subsets. This is implemented and demonstrated in Python and R, with additional vignettes covering how to run Coordinated Gene Activity across Pattern Subsets via Docker deployment and GenePattern Notebook. This protocol presents an end-to-end, optimized workflow that is usable, flexible, totally optimized for contemporary single-cell data formats, accessible and intuitive for computational biologists. This protocol describes procedures for learning cellular and molecular processes from single-cell RNA-sequencing data using the non-negative matrix factorization algorithm Coordinated Gene Activity across Pattern Subsets. Parallel analysis is demonstrated in Python, R and GenePattern Notebook.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ObjectType-Review-3
content type line 23
These authors contributed equally: Jeanette A.I. Johnson, Ashley P. Tsang.
E.J.F., G.L.S.-O. and T.S. originally conceived of the project. E.D.-M. and M.L. prepared a preliminary draft of the manuscript. A.P.T. and J.A.I.J. wrote PyCoGAPS with guidance from G.L.S.-O. A.P.T. implemented the PyCoGAPS GenePattern Notebook and introduced Docker support. M.R. and J.T.M. provided critical GenePattern Notebook support and collaboration. J.A.I.J. and A.P.T. wrote user guides, and J.T.M. performed the PDAC Atlas single-cell analysis included in them. J.B. created the CoGAPS website. All authors read, edited and approved the final manuscript.
Author contributions
ISSN:1754-2189
1750-2799
1750-2799
DOI:10.1038/s41596-023-00892-x