Modeling 0.6 million genes for the rational design of functional cis-regulatory variants and de novo design of cis-regulatory sequences

SignificanceThe enormous variation space and obscure syntax rules of eukaryotic transcriptional regulatory DNA sequences hamper their rational design. Here, we developed PhytoExpr, a deep learning framework that reads regulatory DNA sequences to predict their messenger ribonucleic acid (mRNA) abunda...

Full description

Saved in:
Bibliographic Details
Published inProceedings of the National Academy of Sciences - PNAS Vol. 121; no. 26; p. e2319811121
Main Authors Li, Tianyi, Xu, Hui, Teng, Shouzhen, Suo, Mingrui, Bahitwa, Revocatus, Xu, Mingchi, Qian, Yiheng, Ramstein, Guillaume P., Song, Baoxing, Buckler, Edward S., Wang, Hai
Format Journal Article
LanguageEnglish
Published United States National Academy of Sciences 25.06.2024
Subjects
Online AccessGet full text
ISSN0027-8424
1091-6490
1091-6490
DOI10.1073/pnas.2319811121

Cover

More Information
Summary:SignificanceThe enormous variation space and obscure syntax rules of eukaryotic transcriptional regulatory DNA sequences hamper their rational design. Here, we developed PhytoExpr, a deep learning framework that reads regulatory DNA sequences to predict their messenger ribonucleic acid (mRNA) abundance and also the plant species they are from. PhytoExpr was trained over major clades of the plant kingdom to make predictions on unseen gene families from unseen species. The sequence features learned by PhytoExpr were enriched with conserved noncoding sequences, transcription factor binding sites, and eQTLs. We also fit PhytoExpr into two algorithms for the rational design of functional cis-regulatory variants for genome editing, as well as the de novo design of species-specific cis-regulatory DNA sequences for synthetic biology. Rational design of plant cis-regulatory DNA sequences without expert intervention or prior domain knowledge is still a daunting task. Here, we developed PhytoExpr, a deep learning framework capable of predicting both mRNA abundance and plant species using the proximal regulatory sequence as the sole input. PhytoExpr was trained over 17 species representative of major clades of the plant kingdom to enhance its generalizability. Via input perturbation, quantitative functional annotation of the input sequence was achieved at single-nucleotide resolution, revealing an abundance of predicted high-impact nucleotides in conserved noncoding sequences and transcription factor binding sites. Evaluation of maize HapMap3 single-nucleotide polymorphisms (SNPs) by PhytoExpr demonstrates an enrichment of predicted high-impact SNPs in cis-eQTL. Additionally, we provided two algorithms that harnessed the power of PhytoExpr in designing functional cis-regulatory variants, and de novo creation of species-specific cis-regulatory sequences through in silico evolution of random DNA sequences. Our model represents a general and robust approach for functional variant discovery in population genetics and rational design of regulatory sequences for genome editing and synthetic biology.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
Edited by Daniel Voytas, University of Minnesota Twin Cities, Saint Paul, MN; received November 12, 2023; accepted May 14, 2024
1T.L., H.X., and S.T. contributed equally to this work.
ISSN:0027-8424
1091-6490
1091-6490
DOI:10.1073/pnas.2319811121