Data-driven human transcriptomic modules determined by independent component analysis

Background Analyzing the human transcriptome is crucial in advancing precision medicine, and the plethora of over half a million human microarray samples in the Gene Expression Omnibus (GEO) has enabled us to better characterize biological processes at the molecular level. However, transcriptomic an...

Full description

Saved in:

Bibliographic Details
Published in	BMC bioinformatics Vol. 19; no. 1; pp. 327 - 25
Main Authors	Zhou, Weizhuang, Altman, Russ B.
Format	Journal Article
Language	English
Published	London BioMed Central 17.09.2018 BioMed Central Ltd BMC
Subjects	Algorithms Arthritis, Rheumatoid - genetics Arthritis, Rheumatoid - pathology Bioinformatics Biological activity Biomedical and Life Sciences Cancer Classification Clustering Computational biology Computational Biology/Bioinformatics Computer Appl. in Life Sciences Dimensional analysis DNA Fingerprinting DNA microarrays Functional modules Gene expression Genomics Humans Independent component analysis Learning algorithms Leukemia Leukemia - genetics Leukemia - pathology Life Sciences Machine learning Medical research Microarrays MicroRNAs Model accuracy Modules Oligonucleotide Array Sequence Analysis Precision medicine Principal components analysis Regularization Research Article Rhabdomyosarcoma - genetics Rhabdomyosarcoma - pathology Rheumatoid arthritis Transcriptome Transcriptome analysis Gene expression Functional modules Independent component analysis Transcriptome
Online Access	Get full text
ISSN	1471-2105 1471-2105
DOI	10.1186/s12859-018-2338-4

Cover

More Information
Summary:	Background Analyzing the human transcriptome is crucial in advancing precision medicine, and the plethora of over half a million human microarray samples in the Gene Expression Omnibus (GEO) has enabled us to better characterize biological processes at the molecular level. However, transcriptomic analysis is challenging because the data is inherently noisy and high-dimensional. Gene set analysis is currently widely used to alleviate the issue of high dimensionality, but the user-defined choice of gene sets can introduce biasness in results. In this paper, we advocate the use of a fixed set of transcriptomic modules for such analysis. We apply independent component analysis to the large collection of microarray data in GEO in order to discover reproducible transcriptomic modules that can be used as features for machine learning. We evaluate the usability of these modules across six studies, and demonstrate (1) their usage as features for sample classification, and also their robustness in dealing with small training sets, (2) their regularization of data when clustering samples and (3) the biological relevancy of differentially expressed features. Results We identified 139 reproducible transcriptomic modules, which we term fundamental components (FCs). In studies with less than 50 samples, FC-space classification model outperformed their gene-space counterparts, with higher sensitivity ( p < 0.01). The models also had higher accuracy and negative predictive value ( p < 0.01) for small data sets (less than 30 samples). Additionally, we observed a reduction in batch effects when data is clustered in the FC-space. Finally, we found that differentially expressed FCs mapped to GO terms that were also identified via traditional gene-based approaches. Conclusions The 139 FCs provide biologically-relevant summarization of transcriptomic data, and their performance in low sample settings suggest that they should be employed in such studies in order to harness the data efficiently.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1471-2105 1471-2105
DOI:	10.1186/s12859-018-2338-4