Imputing missing RNA-sequencing data from DNA methylation by using a transfer learning–based neural network

Abstract Background Gene expression plays a key intermediate role in linking molecular features at the DNA level and phenotype. However, owing to various limitations in experiments, the RNA-seq data are missing in many samples while there exist high-quality of DNA methylation data. Because DNA methy...

Full description

Saved in:

Bibliographic Details
Published in	Gigascience Vol. 9; no. 7
Main Authors	Zhou, Xiang, Chai, Hua, Zhao, Huiying, Luo, Ching-Hsing, Yang, Yuedong
Format	Journal Article
Language	English
Published	United States Oxford University Press 01.07.2020
Subjects	Algorithms Cancer Cluster analysis Clustering Computational Biology - methods CpG Islands Datasets Deoxyribonucleic acid DNA DNA Methylation DNA sequencing Epigenesis, Genetic Epigenetics Epigenomics - methods Gene expression Gene Expression Regulation Gene sequencing High-Throughput Nucleotide Sequencing Humans Information processing Learning Machine Learning Medical prognosis Neoplasms - diagnosis Neoplasms - genetics Neoplasms - mortality Neural networks Neural Networks, Computer Phenotypes Prognosis Reproducibility of Results Ribonucleic acid RNA RNA modification Survival analysis Transfer learning DNA methylation transfer learning RNA-seq imputation neural network
Online Access	Get full text
ISSN	2047-217X 2047-217X
DOI	10.1093/gigascience/giaa076

Cover

More Information
Summary:	Abstract Background Gene expression plays a key intermediate role in linking molecular features at the DNA level and phenotype. However, owing to various limitations in experiments, the RNA-seq data are missing in many samples while there exist high-quality of DNA methylation data. Because DNA methylation is an important epigenetic modification to regulate gene expression, it can be used to predict RNA-seq data. For this purpose, many methods have been developed. A common limitation of these methods is that they mainly focus on a single cancer dataset and do not fully utilize information from large pan-cancer datasets. Results Here, we have developed a novel method to impute missing gene expression data from DNA methylation data through a transfer learning–based neural network, namely, TDimpute. In the method, the pan-cancer dataset from The Cancer Genome Atlas (TCGA) was utilized for training a general model, which was then fine-tuned on the specific cancer dataset. By testing on 16 cancer datasets, we found that our method significantly outperforms other state-of-the-art methods in imputation accuracy with a 7–11% improvement under different missing rates. The imputed gene expression was further proved to be useful for downstream analyses, including the identification of both methylation–driving and prognosis-related genes, clustering analysis, and survival analysis on the TCGA dataset. More importantly, our method was indicated to be useful for general purposes by an independent test on the Wilms tumor dataset from the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) project. Conclusions TDimpute is an effective method for RNA-seq imputation with limited training samples.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2047-217X 2047-217X
DOI:	10.1093/gigascience/giaa076