NullSeq: A Tool for Generating Random Coding Sequences with Desired Amino Acid and GC Contents

The existence of over- and under-represented sequence motifs in genomes provides evidence of selective evolutionary pressures on biological mechanisms such as transcription, translation, ligand-substrate binding, and host immunity. In order to accurately identify motifs and other genome-scale patter...

Full description

Saved in:

Bibliographic Details
Published in	PLoS computational biology Vol. 12; no. 11; p. e1005184
Main Authors	Liu, Sophia S., Hockenberry, Adam J., Lancichinetti, Andrea, Jewett, Michael C., Amaral, Luís A. N.
Format	Journal Article
Language	English
Published	United States Public Library of Science 01.11.2016 Public Library of Science (PLoS)
Subjects	Algorithms Amino acid sequencing Amino acids Base Composition - genetics Bias Bioengineering Biology and Life Sciences DNA sequencing Engineering Enzymes Funding Genomes Hypothesis testing Interdisciplinary aspects Maximum entropy Methods Physical Sciences Protein Engineering - methods Proteins - chemistry Proteins - genetics Research and Analysis Methods RNA polymerase Scholarships & fellowships Sequence Analysis, DNA - methods Sequence Analysis, Protein - methods Software United States > US Illinois
Online Access	Get full text
ISSN	1553-7358 1553-734X 1553-7358
DOI	10.1371/journal.pcbi.1005184

Cover

More Information
Summary:	The existence of over- and under-represented sequence motifs in genomes provides evidence of selective evolutionary pressures on biological mechanisms such as transcription, translation, ligand-substrate binding, and host immunity. In order to accurately identify motifs and other genome-scale patterns of interest, it is essential to be able to generate accurate null models that are appropriate for the sequences under study. While many tools have been developed to create random nucleotide sequences, protein coding sequences are subject to a unique set of constraints that complicates the process of generating appropriate null models. There are currently no tools available that allow users to create random coding sequences with specified amino acid composition and GC content for the purpose of hypothesis testing. Using the principle of maximum entropy, we developed a method that generates unbiased random sequences with pre-specified amino acid and GC content, which we have developed into a python package. Our method is the simplest way to obtain maximally unbiased random sequences that are subject to GC usage and primary amino acid sequence constraints. Furthermore, this approach can easily be expanded to create unbiased random sequences that incorporate more complicated constraints such as individual nucleotide usage or even di-nucleotide frequencies. The ability to generate correctly specified null models will allow researchers to accurately identify sequence motifs which will lead to a better understanding of biological processes as well as more effective engineering of biological systems.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Conceptualization: SSL AJH AL MCJ LANA. Data curation: SSL AJH. Formal analysis: SSL AJH LANA. Funding acquisition: SSL AJH MCJ LANA. Investigation: SSL AJH MCJ LANA. Methodology: SSL AL LANA. Project administration: MCJ LANA. Resources: MCJ LANA. Software: SSL. Supervision: MCJ LANA. Validation: SSL AJH MCJ LANA. Visualization: SSL AJH MCJ LANA. Writing – original draft: SSL AJH. Writing – review & editing: SSL AJH AL MCJ LANA. The authors have declared that no competing interests exist.
ISSN:	1553-7358 1553-734X 1553-7358
DOI:	10.1371/journal.pcbi.1005184