Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data

Background As Next-Generation Sequencing data becomes available, existing hardware environments do not provide sufficient storage space and computational power to store and process the data due to their enormous size. This is and will be a frequent problem that is encountered everyday by researchers...

Full description

Saved in:

Bibliographic Details
Published in	BMC bioinformatics Vol. 13; no. 1; p. 100
Main Authors	Qiao, Dandi, Yip, Wai-Ki, Lange, Christoph
Format	Journal Article
Language	English
Published	London BioMed Central 16.05.2012 BioMed Central Ltd Springer Nature B.V BMC
Subjects	Algorithms Analysis Bioinformatics Biomedical and Life Sciences Computational Biology - methods Computational Biology/Bioinformatics Computer Appl. in Life Sciences Data Compression - methods Data management DNA sequencing Gene loci Genetic algorithms Genetics High-Throughput Nucleotide Sequencing - methods Information management Life Sciences Methodology Methodology Article Methods Microarrays Nucleotide sequencing Sequence analysis (methods) Software Storage Studies United States Germany Compression Factor Minor Allele Frequency Compression Rate Disk Space Compression Algorithm
Online Access	Get full text
ISSN	1471-2105 1471-2105
DOI	10.1186/1471-2105-13-100

Cover

More Information
Summary:	Background As Next-Generation Sequencing data becomes available, existing hardware environments do not provide sufficient storage space and computational power to store and process the data due to their enormous size. This is and will be a frequent problem that is encountered everyday by researchers who are working on genetic data. There are some options available for compressing and storing such data, such as general-purpose compression software, PBAT/PLINK binary format, etc. However, these currently available methods either do not offer sufficient compression rates, or require a great amount of CPU time for decompression and loading every time the data is accessed. Results Here, we propose a novel and simple algorithm for storing such sequencing data. We show that, the compression factor of the algorithm ranges from 16 to several hundreds, which potentially allows SNP data of hundreds of Gigabytes to be stored in hundreds of Megabytes. We provide a C++ implementation of the algorithm, which supports direct loading and parallel loading of the compressed format without requiring extra time for decompression. By applying the algorithm to simulated and real datasets, we show that the algorithm gives greater compression rate than the commonly used compression methods, and the data-loading process takes less time. Also, The C++ library provides direct-data-retrieving functions, which allows the compressed information to be easily accessed by other C++ programs. Conclusions The SpeedGene algorithm enables the storage and the analysis of next generation sequencing data in current hardware environment, making system upgrades unnecessary.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 ObjectType-Article-2 ObjectType-Feature-1
ISSN:	1471-2105 1471-2105
DOI:	10.1186/1471-2105-13-100