Correcting sequencing errors in DNA coding regions using a dynamic programming approach

This paper presents an algorithm for detecting and 'correcting' sequencing errors that occur in DNA coding regions. The types of sequencing errors addressed are insertions and deletions (indels) of DNA bases. The goal is to provide a capability which makes single-pass or low-redundancy seq...

Full description

Saved in:

Bibliographic Details
Published in	Bioinformatics Vol. 11; no. 2; pp. 117 - 124
Main Authors	Xu, Y, Mural, R.J, Uberbacher, E.C
Format	Journal Article
Language	English
Published	Washington, DC Oxford University Press 01.04.1995 Oxford
Subjects	Algorithms Biological and medical sciences deletions DNA dynamic programming errors Exons Fundamental and applied biological sciences. Psychology General aspects Humans insertions Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) methods nucleotide sequences Protein Biosynthesis Sequence Analysis, DNA Sequence Analysis, DNA - methods Sequence Analysis, DNA - standards Software Genetic code DNA Computerized processing Computer Software Detection Algorithm Corrections Implementation
Online Access	Get full text
ISSN	0266-7061 1367-4803 1460-2059
DOI	10.1093/bioinformatics/11.2.117

Cover

More Information
Summary:	This paper presents an algorithm for detecting and 'correcting' sequencing errors that occur in DNA coding regions. The types of sequencing errors addressed are insertions and deletions (indels) of DNA bases. The goal is to provide a capability which makes single-pass or low-redundancy sequence data more informative reducing the need for high-redundancy sequencing for gene identification and characterization purposes. This would permit improved sequencing efficiency and reduce genome sequencing costs. The algorithm detects sequencing errors by discovering changes in the statistically preferred reading frame within a putative coding region and then inserts a number of 'neutral' bases at a perceived reading frame transition point to make the putative exon candidate frame consistent. We have implemented the algorithm as a front-end subsystem of the GRAIL DNA sequence analysis system to construct a version which is very error tolerant and also intend to use this as a testbed for further development of sequencing error correction technology. Preliminary test results have shown the usefulness of this algorithm and also exhibited some of its weakness providing possible directions for further improvement. On a test set consisting of 68 human DNA sequences with 1% randomly generated indels in coding regions the algorithm detected and corrected 76% of the indels. The average distance between the position of an indel and the predicted one was 9.4 bases. With this subsystem in place GRAIL correctly predicted 89% of the coding messages with 10% false message on the 'corrected' sequences compared to 69% correctly predicted coding messages and 11% falsely predicted messages on the 'corrupted' sequences using standard GRAIL II method (version 1.2). The method uses a dynamic programming algorithm and runs in time and space linear to the size of the input sequence.
Bibliography:	istex:5CB27C72483A5A18F37253E2C89C6BCA0549E201 ark:/67375/HXZ-X5S4NSNF-8 ArticleID:11.2.117 2To whom reprint requests should be sent ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 ObjectType-Article-1 ObjectType-Feature-2
ISSN:	0266-7061 1367-4803 1460-2059
DOI:	10.1093/bioinformatics/11.2.117