Optimal Sparse Descriptor Selection for QSAR Using Bayesian Methods

Choosing a set of molecular descriptors (features) that is most relevant to a given biological response variable is a very important problem in QSAR that has not be solved in an optimal robust way. It is an interesting and important class of mathematical problems, where the number of variables great...

Full description

Saved in:
Bibliographic Details
Published inQSAR & combinatorial science Vol. 28; no. 6-7; pp. 645 - 653
Main Authors Burden, F. R., Winkler, D. A.
Format Journal Article
LanguageEnglish
Published Weinheim WILEY-VCH Verlag 01.07.2009
WILEY‐VCH Verlag
Subjects
Online AccessGet full text
ISSN1611-020X
1611-0218
DOI10.1002/qsar.200810173

Cover

More Information
Summary:Choosing a set of molecular descriptors (features) that is most relevant to a given biological response variable is a very important problem in QSAR that has not be solved in an optimal robust way. It is an interesting and important class of mathematical problems, where the number of variables greatly outweighs the number of observations (grossly underdetermined systems). We have used two Bayesian approaches to carry out this task using a suite of QSAR data sets. We employed a specialized sparse Bayesian feature reduction method based on an EM algorithm with a Laplacian prior to select a small set of the most relevant descriptors for modeling the response variables from a much larger pool of possibilities. Having chosen the optimum descriptors in a supervised manner, we used a Bayesian regularized neural network to carry out nonlinear regression and derive robust parsimonious QSAR models for five drug data sets. Models were validated using independent test sets, and results compared with other contemporary descriptor selection methods. Issues around validating small QSAR data sets were also discussed in detail. The sparse feature selection algorithm proved to be an excellent, robust method for selecting descriptors for QSAR models, as it is supervised (descriptors chosen in a context‐dependent manner), parsimonious (models not overly complex), and inherently interpretable. Coupled to a robust parsimonious nonlinear modeling method such as the Bayesian regularized neural net, the combination provides a means of optimally modeling the data, and allowing interpretation of the model in terms of the most relevant descriptors.
Bibliography:ark:/67375/WNG-CCR6BKR4-7
ArticleID:QSAR200810173
istex:2392194E65E67C988BED7B8B809DF5B5BB97C573
ISSN:1611-020X
1611-0218
DOI:10.1002/qsar.200810173