Highly accurate assembly polishing with DeepPolisher

Accurate genome assemblies are essential for biological research, but even the highest-quality assemblies retain errors caused by the technologies used to construct them. Base-level errors are typically fixed with an additional polishing step that uses reads aligned to the draft assembly to identify...

Full description

Saved in:

Bibliographic Details
Published in	Genome research Vol. 35; no. 7; pp. 1595 - 1608
Main Authors	Mastoras, Mira, Asri, Mobin, Brambrink, Lucas, Hebbar, Prajna, Kolesnikov, Alexey, Cook, Daniel E., Nattestad, Maria, Lucas, Julian, Won, Taylor S., Chang, Pi-Chuan, Carroll, Andrew, Paten, Benedict, Shafin, Kishwar
Format	Journal Article
Language	English
Published	United States Cold Spring Harbor Laboratory Press 01.07.2025
Subjects	Algorithms Diploids Genome, Human Genomes Genomics - methods High-Throughput Nucleotide Sequencing - methods Homozygote Humans Sequence Analysis, DNA - methods Software
Online Access	Get full text
ISSN	1088-9051 1549-5469 1549-5469
DOI	10.1101/gr.280149.124

Cover

More Information
Summary:	Accurate genome assemblies are essential for biological research, but even the highest-quality assemblies retain errors caused by the technologies used to construct them. Base-level errors are typically fixed with an additional polishing step that uses reads aligned to the draft assembly to identify necessary edits. However, current methods struggle to find a balance between over- and underpolishing. Here, we present an encoder-only transformer model for assembly polishing called DeepPolisher, which predicts corrections to the underlying sequence using Pacific Biosciences (PacBio) HiFi read alignments to a diploid assembly. Our pipeline introduces a method, PHAsing Reads in Areas Of Homozygosity (PHARAOH), which uses ultralong Oxford Nanopore Technologies (ONT) data to ensure alignments are accurately phased and to correctly introduce heterozygous edits in falsely homozygous regions. We demonstrate that the DeepPolisher pipeline can reduce assembly errors by approximately half, mostly driven by reductions in indel errors. We have applied our DeepPolisher-based pipeline to 180 assemblies from the next Human Pangenome Reference Consortium (HPRC) data release, producing an average predicted quality value (QV) improvement of 3.4 (54% error reduction) for the majority of the genome.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1088-9051 1549-5469 1549-5469
DOI:	10.1101/gr.280149.124