A semi-supervised deep learning approach for predicting the functional effects of genomic non-coding variations

Background Understanding the functional effects of non-coding variants is important as they are often associated with gene-expression alteration and disease development. Over the past few years, many computational tools have been developed to predict their functional impact. However, the intrinsic d...

Full description

Saved in:

Bibliographic Details
Published in	BMC Bioinformatics Vol. 22; no. Suppl 6; pp. 1 - 12
Main Authors	Jia, Hao, Park, Sung-Joon, Nakai, Kenta
Format	Journal Article
Language	English
Published	London Springer Science and Business Media LLC 02.06.2021 BioMed Central BioMed Central Ltd Springer Nature B.V BMC
Subjects	Accessibility Algorithms Binding sites Bioinformatics Biology (General) Biomedical and Life Sciences Cell lines Chromatin Computational Biology/Bioinformatics Computer Appl. in Life Sciences Computer applications Computer applications to medicine. Medical informatics Conformation Data mining Datasets Deep learning Deoxyribonucleic acid DNA Epigenetics Epigenome Gene expression Gene regulation Genetic research Genetic variation Genomes Genomics Histone Code Histones Humans Labels Life Sciences Machine learning Methods Microarrays Mutation Neural networks Non-coding variants Nucleotide sequence Performance prediction Pseudo label QH301-705.5 R858-859.7 Semi-supervised learning Software Supervised Machine Learning Support vector machines Japan Deep learning Non-coding variants Epigenome Semi-supervised learning Pseudo label
Online Access	Get full text
ISSN	1471-2105 1471-2105
DOI	10.1186/s12859-021-03999-8

Cover

More Information
Summary:	Background Understanding the functional effects of non-coding variants is important as they are often associated with gene-expression alteration and disease development. Over the past few years, many computational tools have been developed to predict their functional impact. However, the intrinsic difficulty in dealing with the scarcity of data leads to the necessity to further improve the algorithms. In this work, we propose a novel method, employing a semi-supervised deep-learning model with pseudo labels, which takes advantage of learning from both experimentally annotated and unannotated data. Results We prepared known functional non-coding variants with histone marks, DNA accessibility, and sequence context in GM12878, HepG2, and K562 cell lines. Applying our method to the dataset demonstrated its outstanding performance, compared with that of existing tools. Our results also indicated that the semi-supervised model with pseudo labels achieves higher predictive performance than the supervised model without pseudo labels. Interestingly, a model trained with the data in a certain cell line is unlikely to succeed in other cell lines, which implies the cell-type-specific nature of the non-coding variants. Remarkably, we found that DNA accessibility significantly contributes to the functional consequence of variants, which suggests the importance of open chromatin conformation prior to establishing the interaction of non-coding variants with gene regulation. Conclusions The semi-supervised deep learning model coupled with pseudo labeling has advantages in studying with limited datasets, which is not unusual in biology. Our study provides an effective approach in finding non-coding mutations potentially associated with various biological phenomena, including human diseases.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1471-2105 1471-2105
DOI:	10.1186/s12859-021-03999-8