Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices
Background With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. However, substantial challenges remain in leveraging these large scale molecular data. One of the f...
Saved in:
| Published in | BMC genomics Vol. 21; no. 1; pp. 497 - 14 |
|---|---|
| Main Authors | , |
| Format | Journal Article |
| Language | English |
| Published |
London
BioMed Central
20.07.2020
BioMed Central Ltd Springer Nature B.V BMC |
| Subjects | |
| Online Access | Get full text |
| ISSN | 1471-2164 1471-2164 |
| DOI | 10.1186/s12864-020-06892-5 |
Cover
| Summary: | Background
With the rapid growth rate of newly sequenced genomes, species tree inference from genes sampled throughout the whole genome has become a basic task in comparative and evolutionary biology. However, substantial challenges remain in leveraging these large scale molecular data. One of the foremost challenges is to develop efficient methods that can handle missing data. Popular distance-based methods, such as NJ (neighbor joining) and UPGMA (unweighted pair group method with arithmetic mean) require complete distance matrices without any missing data.
Results
We introduce two highly accurate machine learning based distance imputation techniques. These methods are based on matrix factorization and
autoencoder
based deep learning architectures. We evaluated these two methods on a collection of simulated and biological datasets. Experimental results suggest that our proposed methods match or improve upon the best alternate distance imputation techniques. Moreover, these methods are scalable to large datasets with hundreds of taxa, and can handle a substantial amount of missing data.
Conclusions
This study shows, for the first time, the power and feasibility of applying deep learning techniques for imputing distance matrices. Thus, this study advances the state-of-the-art in phylogenetic tree construction in the presence of missing data. The proposed methods are available in open source form at
https://github.com/Ananya-Bhattacharjee/ImputeDistances
. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
| ISSN: | 1471-2164 1471-2164 |
| DOI: | 10.1186/s12864-020-06892-5 |