Optimal Sparse Descriptor Selection for QSAR Using Bayesian Methods

Choosing a set of molecular descriptors (features) that is most relevant to a given biological response variable is a very important problem in QSAR that has not be solved in an optimal robust way. It is an interesting and important class of mathematical problems, where the number of variables great...

Full description

Saved in:

Bibliographic Details
Published in	QSAR & combinatorial science Vol. 28; no. 6-7; pp. 645 - 653
Main Authors	Burden, F. R., Winkler, D. A.
Format	Journal Article
Language	English
Published	Weinheim WILEY-VCH Verlag 01.07.2009 WILEY‐VCH Verlag
Subjects	Bayesian methods Descriptors Feature selection Medicinal chemistry Structure-activity relationships
Online Access	Get full text
ISSN	1611-020X 1611-0218
DOI	10.1002/qsar.200810173

Cover

More Information
Summary:	Choosing a set of molecular descriptors (features) that is most relevant to a given biological response variable is a very important problem in QSAR that has not be solved in an optimal robust way. It is an interesting and important class of mathematical problems, where the number of variables greatly outweighs the number of observations (grossly underdetermined systems). We have used two Bayesian approaches to carry out this task using a suite of QSAR data sets. We employed a specialized sparse Bayesian feature reduction method based on an EM algorithm with a Laplacian prior to select a small set of the most relevant descriptors for modeling the response variables from a much larger pool of possibilities. Having chosen the optimum descriptors in a supervised manner, we used a Bayesian regularized neural network to carry out nonlinear regression and derive robust parsimonious QSAR models for five drug data sets. Models were validated using independent test sets, and results compared with other contemporary descriptor selection methods. Issues around validating small QSAR data sets were also discussed in detail. The sparse feature selection algorithm proved to be an excellent, robust method for selecting descriptors for QSAR models, as it is supervised (descriptors chosen in a context‐dependent manner), parsimonious (models not overly complex), and inherently interpretable. Coupled to a robust parsimonious nonlinear modeling method such as the Bayesian regularized neural net, the combination provides a means of optimally modeling the data, and allowing interpretation of the model in terms of the most relevant descriptors.
Bibliography:	ark:/67375/WNG-CCR6BKR4-7 ArticleID:QSAR200810173 istex:2392194E65E67C988BED7B8B809DF5B5BB97C573
ISSN:	1611-020X 1611-0218
DOI:	10.1002/qsar.200810173