Machine learning classification of archaea and bacteria identifies novel predictive genomic features

Background Archaea and Bacteria are distinct domains of life that are adapted to a variety of ecological niches. Several genome-based methods have been developed for their accurate classification, yet many aspects of the specific genomic features that determine these differences are not fully unders...

Full description

Saved in:
Bibliographic Details
Published inBMC genomics Vol. 25; no. 1; pp. 955 - 15
Main Authors Bobbo, Tania, Biscarini, Filippo, Yaddehige, Sachithra K., Alberghini, Leonardo, Rigoni, Davide, Bianchi, Nicoletta, Taccioli, Cristian
Format Journal Article
LanguageEnglish
Published London BioMed Central 14.10.2024
BioMed Central Ltd
Springer Nature B.V
BMC
Subjects
Online AccessGet full text
ISSN1471-2164
1471-2164
DOI10.1186/s12864-024-10832-y

Cover

More Information
Summary:Background Archaea and Bacteria are distinct domains of life that are adapted to a variety of ecological niches. Several genome-based methods have been developed for their accurate classification, yet many aspects of the specific genomic features that determine these differences are not fully understood. In this study, we used publicly available whole-genome sequences from bacteria ( N = 2546 ) and archaea ( N = 109 ). From these, a set of genomic features (nucleotide frequencies and proportions, coding sequences (CDS), non-coding, ribosomal and transfer RNA genes (ncRNA, rRNA, tRNA), Chargaff’s, topological entropy and Shannon’s entropy scores) was extracted and used as input data to develop machine learning models for the classification of archaea and bacteria. Results The classification accuracy ranged from 0.993 (Random Forest) to 0.998 (Neural Networks). Over the four models, only 11 examples were misclassified, especially those belonging to the minority class (Archaea). From variable importance, tRNA topological and Shannon’s entropy, nucleotide frequencies in tRNA, rRNA and ncRNA, CDS, tRNA and rRNA Chargaff’s scores have emerged as the top discriminating factors. In particular, tRNA entropy (both topological and Shannon’s) was the most important genomic feature for classification, pointing at the complex interactions between the genetic code, tRNAs and the translational machinery. Conclusions tRNA, rRNA and ncRNA genes emerged as the key genomic elements that underpin the classification of archaea and bacteria. In particular, higher nucleotide diversity was found in tRNA from bacteria compared to archaea. The analysis of the few classification errors reflects the complex phylogenetic relationships between bacteria, archaea and eukaryotes.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:1471-2164
1471-2164
DOI:10.1186/s12864-024-10832-y