A Greedy Two-stage Gibbs Sampling Method for Motif Discovery in Biological Sequences

For the motif discovery problem of DNA or protein sequences, a greedy two-stage Gibbs sampling algorithm is presented, and the related software package is called Greedy Motifsam. Based on position weight matrix (PWM) motif model, a greedy strategy for choosing the initial parameters of PWM is employ...

Full description

Saved in:
Bibliographic Details
Published inJournal of Information Science and Engineering Vol. 26; no. 6; pp. 2309 - 2318
Main Authors 刘立芳(LI-FANG LIU), 焦李成(LI-CHENG JIAO)
Format Journal Article
LanguageEnglish
Published Taipei 社團法人中華民國計算語言學學會 01.11.2010
Institute of Information Science, Academia sinica
Subjects
Online AccessGet full text
ISSN1016-2364
DOI10.6688/JISE.2010.26.6.23

Cover

More Information
Summary:For the motif discovery problem of DNA or protein sequences, a greedy two-stage Gibbs sampling algorithm is presented, and the related software package is called Greedy Motifsam. Based on position weight matrix (PWM) motif model, a greedy strategy for choosing the initial parameters of PWM is employed. Two sampling methods, site sampler and motif sampler, are used. Site sampler is used to find one occurrence per sequence of the motif in the dataset. Motif sampler is used to find zero or more non-overlapping occurrences of the motif in each sequence. The algorithm is capable of discovering several different motifs with differing numbers of occurrences in a single dataset. We use the binding sites (motif) information of eukaryotic transcription factors stored in TRANSFAC database to test our methods. The prediction accuracy, scalability and reliability are compared to several other methods. Our proposed method is also illustrated as applied to helix-turn-helix proteins, lipocalins, and prenyltransferases. The Greedy Motifsam software is available at http://lxy.xidian.edu.cn/math/intro/teachers/ qxg/MotifSAM.zip.
ISSN:1016-2364
DOI:10.6688/JISE.2010.26.6.23