Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone

Advances in both next-generation sequencing (NGS) technologies and mass spectrometry-based proteomics have allowed the continuous growth of available proteomes and metaproteomes in biological databases. However, the high protein structural variety in known proteomes makes the protein functional char...

Full description

Saved in:
Bibliographic Details
Published inBMC bioinformatics Vol. 18; no. 1; pp. 349 - 14
Main Authors Ruiz-Blanco, Yasser B., Agüero-Chapin, Guillermin, García-Hernández, Enrique, Álvarez, Orlando, Antunes, Agostinho, Green, James
Format Journal Article
LanguageEnglish
Published London BioMed Central 21.07.2017
BioMed Central Ltd
Springer Nature B.V
BMC
Subjects
Online AccessGet full text
ISSN1471-2105
1471-2105
DOI10.1186/s12859-017-1758-x

Cover

More Information
Summary:Advances in both next-generation sequencing (NGS) technologies and mass spectrometry-based proteomics have allowed the continuous growth of available proteomes and metaproteomes in biological databases. However, the high protein structural variety in known proteomes makes the protein functional characterization a challenging task in modern Computational Biology and Bioinformatics [1]. As manually curated annotations are available only for a small portion of investigated systems; the wealth of genomic and transcriptomic information generated from NGS technologies [2] requires the use of accurate computational annotation tools [3]. The same is true for the functional annotation of 3D structures in databases such as the PDB [4], SCOP [5] and CATH [6], as biologically uncharacterized proteins are being incorporated continuously in these databases; currently about 3725 structures in the PDB have a classification of ‘unknown function’. The assignment of a functional class for a query protein is a complex problem, not just because of the structural complexity but, because a single protein can have multiple functions, either due to its multiple domains or its subcellular locations and substrate concentrations [7]. Nevertheless, protein functional inferences have traditionally relied on structural/sequence similarities provided by alignment-based algorithms. The most common alignment-based (AB) approaches used in genomic and amino acid sequence databases to identify protein functional signals include: the Smith Waterman algorithm [8], the Basic Local Alignment Search Tool (BLAST) suite of programs [9], and profile Hidden Markov Models (HMMs) [10]. Profile HMM are at the core of the popular Protein family (Pfam) database [11]. Particularly for an effective identification of enzymatic functions within proteomes, BLAST and HMMs have been implemented in the annotation pipeline of EnzymeDetector along with the integration of the main biological databases [12]. Despite the large success of these methods, sequence-similarity-based approaches often fail when attempting to align proteins that share less than 30-40% identity. Alignments within this so-called twilight zone are often unreliable, resulting in reduced prediction accuracy [13, 14]. This handicap has caused a sustained increase in the number of unannotated proteins during the examination of genomes and proteomes from a variety of organism and environmental samples. Consequently, alignment-free (AF) approaches are needed to overcome such limitations, to accurately detect gene/protein signatures within the twilight zone, and to provide clues about the functional classes e.g. enzymes or non-enzymes for subsets of uncharacterized proteins. Given the supremacy of AB approaches for predicting the function of a protein, we considered interesting and valuable to dig into the state of the art of AF methods and make our own contribution in this field. In this sense, we believe that the development of general-purposes AF prediction methods, based on new protein structure descriptors, can contribute to enhance the predictability of protein functional classes such as those of top hierarchy: enzymes and non-enzymes. This discrimination challenges current classification approaches due to their intrinsic structural and functional diversity. Generally, AF methods have been based on amino acid composition description, such as the one reported in Ref. [15] to detect remote members of the of G-protein-coupled receptor superfamily using support vector machines (SVMs). Also, AF descriptors such as the amino acid content and the amino-acid-pair-association rules, were used along with several classification methods to categorize protein sequences [16]. The web-server Composition-based Protein identification (COPid) was developed to annotate the function of a full or partial protein strictly from its composition [17]. One of the most popular AF protein features are those based on Chou’s concept of pseudo amino acid composition (PseAAC), initially used to leverage the effect of sequence order together with the amino acid composition for improving the prediction quality of protein cellular attributes [18]. This concept has been widely used to predict many protein attributes [19-21] including functional assignments such as whether a protein sequence is an enzyme or not, as well as the enzyme class they belong to [22, 23]. The experience achieved by Chou et al. in detecting and sub-classifying enzyme-like proteins was summarized in the EzyPred webserver [24]. In a similar way to the Chou’s descriptors, Caballero and Fernandez defined Amino Acid Sequence Autocorrelation (AASA) vectors, but, instead of using a distance function (difference between pairs of a property values) like in the PseAAC, they used autocorrelation (multiplication of a property values). This latter approach was applied to predict the conformational stability of human lysozyme mutants [25]. AASA is an extension of the Broto-Moreau autocorrelation topological indices previously used in structure-activity relationship (SAR) studies of protein sequences [26]. Until recently, the most comprehensive computational tool for the generation of AF descriptors of amino acid sequences was the server PROFEAT [27]. This server gathers most of the above-mentioned approaches in a flexible computational tool enabling the generation of thousands of features per query protein. Other efforts for efficient numerical encoding of proteins involve the extension of molecular descriptors, originally defined for small and mid-sized molecules, into protein descriptors. Following this methodology, Gonzalez-Diaz et al. have extended their Markovian stochastic descriptors to characterize protein sequences [28]. In addition, graphical approaches have been validated and implemented in our program TI2BioP (Topological Indices to BioPolymers), which allows the calculation of spectral moments as topological indices from different 2D graphical approaches for DNA, RNA, and protein biopolymers [29]. We have recently introduced ProtDCal, a software package for the general-purpose-numeric encoding of both protein sequences and structures [30]. This software uses a distinctive divide-and-conquer methodology based on extracting diverse groups of amino acids and aggregating the contributions of the residues in each group into scalar descriptors, giving rise to a vast number of features that balance local and global characteristics of the protein sequence and structure. Principal component analysis has been used to demonstrate the distinct information content of ProtDCal’s descriptors relative to PROFEAT among representatives from the different sequence-based descriptor families encoded by these two programs. The applicability of ProtDCal’s sequence-based descriptors for automatic functional annotation was first illustrated in the classification of the N-glycosylation state of asparagine residues of human and mammalian proteins [30, 31]. Recently, sequence-based features derived from ProtDCal were also used in the development of a multi-target predictor of antibacterial peptides against 50 Gram positive bacteria [32]. However, the utility of the 3D structure features generated using ProtDCal still have not been demonstrated. Therefore, firstly, this work aims to validate the applicability of different families of descriptors implemented in TI2BioP and ProtDCal for the discrimination between enzymes and non-enzymes using the structurally non-redundant benchmark dataset designed by Dobson and Doig (D&D) [33]. In a second step, the obtained model is applied to distinguish enzymes and non-enzymes among a subset of uncharacterized proteins. The descriptors of our programs represent the four largest families of AF descriptors: sequence-composition-based (0D), linear-topology-based (1D), pseudo-fold-topology-based (2D) and 3D-structure features (3D). The 0D, 1D and 3D protein descriptor families are calculated by means of ProtDCal while the 2D descriptors are generated by TI2BioP. More information about the descriptor classes can be found in Additional file 1. We show the superior performance of a model using 3D information represented by ProtDCal’s features, relative to the previously developed 3D methods. In addition, we introduce a model using sequence-based features that rivals several of the 3D-structure-based methods evaluated on the same data. This model was comparatively evaluated with Ezypred and EnzymeDetector on 30 proteins which were originally uncharacterized during the annotation of the Shewanella oneidensis proteome in 2002, and currently represent a benchmark annotation dataset [34]. Our model achieves a higher success rate than EzyPred. Such a result highlights that our general-purpose protein descriptors, followed by supervised feature selection, can efficiently encode subtle structural elements that distinguish enzymes from non-enzyme proteins.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:1471-2105
1471-2105
DOI:10.1186/s12859-017-1758-x