EC: an efficient error correction algorithm for short reads

Background In highly parallel next-generation sequencing (NGS) techniques millions to billions of short reads are produced from a genomic sequence in a single run. Due to the limitation of the NGS technologies, there could be errors in the reads. The error rate of the reads can be reduced with trimm...

Full description

Saved in:

Bibliographic Details
Published in	BMC bioinformatics Vol. 16; no. Suppl 17; p. S2
Main Authors	Saha, Subrata, Rajasekaran, Sanguthevar
Format	Journal Article
Language	English
Published	London BioMed Central 07.12.2015 BioMed Central Ltd
Subjects	Algorithms Bioinformatics Biomedical and Life Sciences Comparative analysis Computational Biology/Bioinformatics Computer Appl. in Life Sciences Computer Simulation Databases, Nucleic Acid Genomics High-Throughput Nucleotide Sequencing - methods Life Sciences Microarrays Sequence Analysis, DNA - methods Error Corrector Hash Table Reference Genome Synthetic Dataset Bloom Filter
Online Access	Get full text
ISSN	1471-2105 1471-2105
DOI	10.1186/1471-2105-16-S17-S2

Cover

More Information
Summary:	Background In highly parallel next-generation sequencing (NGS) techniques millions to billions of short reads are produced from a genomic sequence in a single run. Due to the limitation of the NGS technologies, there could be errors in the reads. The error rate of the reads can be reduced with trimming and by correcting the erroneous bases of the reads. It helps to achieve high quality data and the computational complexity of many biological applications will be greatly reduced if the reads are first corrected. We have developed a novel error correction algorithm called EC and compared it with four other state-of-the-art algorithms using both real and simulated sequencing reads. Results We have done extensive and rigorous experiments that reveal that EC is indeed an effective, scalable, and efficient error correction tool. Real reads that we have employed in our performance evaluation are Illumina-generated short reads of various lengths. Six experimental datasets we have utilized are taken from sequence and read archive (SRA) at NCBI. The simulated reads are obtained by picking substrings from random positions of reference genomes. To introduce errors, some of the bases of the simulated reads are changed to other bases with some probabilities. Conclusions Error correction is a vital problem in biology especially for NGS data. In this paper we present a novel algorithm, called Error Corrector (EC) , for correcting substitution errors in biological sequencing reads. We plan to investigate the possibility of employing the techniques introduced in this research paper to handle insertion and deletion errors also. Software availability The implementation is freely available for non-commercial purposes. It can be downloaded from: http://engr.uconn.edu/~rajasek/EC.zip .
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1471-2105 1471-2105
DOI:	10.1186/1471-2105-16-S17-S2