Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes

Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 9; no. 1; p. e86703
Main Authors	Lou, Wangchao, Wang, Xiaoqing, Chen, Fan, Chen, Yixiao, Jiang, Bo, Zhang, Hua
Format	Journal Article
Language	English
Published	United States Public Library of Science 24.01.2014 Public Library of Science (PLoS)
Subjects	Algorithms Amino acid sequence Amino acids Amino Acids - chemistry Amino Acids - metabolism Animals Artificial intelligence Basis functions Bayes Theorem Bayesian analysis Bioinformatics Biology Classifiers Computer Science Datasets Datasets as Topic Decision trees Deoxyribonucleic acid DNA DNA - chemistry DNA - metabolism DNA-binding protein DNA-Binding Proteins - chemistry DNA-Binding Proteins - metabolism Engineering Forest management Functions (mathematics) Gene expression Gene regulation Humans Mathematical analysis Mathematics Methods Neural networks Normal Distribution Nucleotide sequence Position-Specific Scoring Matrices Predictions Protein structure Protein Structure, Secondary Proteins Radial basis function Ribonucleic acid RNA RNA-binding protein RNA-Binding Proteins - chemistry RNA-Binding Proteins - metabolism ROC Curve Secondary structure Sequence Analysis, Protein - statistics & numerical data Solvents China
Online Access	Get full text
ISSN	1932-6203 1932-6203
DOI	10.1371/journal.pone.0086703

Cover

More Information
Summary:	Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naïve Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Conceived and designed the experiments: WL XW BJ HZ. Performed the experiments: WL. Analyzed the data: WL XW FC. Contributed reagents/materials/analysis tools: YC HZ. Wrote the paper: XW HZ. Competing Interests: The authors have declared that no competing interests exist.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0086703