Reference-based genome compression using the longest matched substrings with parallelization consideration

Background A large number of researchers have devoted to accelerating the speed of genome sequencing and reducing the cost of genome sequencing for decades, and they have made great strides in both areas, making it easier for researchers to study and analyze genome data. However, how to efficiently...

Full description

Saved in:

Bibliographic Details
Published in	BMC bioinformatics Vol. 24; no. 1; pp. 1 - 16
Main Authors	Lu, Zhiwen, Guo, Lu, Chen, Jianhua, Wang, Rongshu
Format	Journal Article
Language	English
Published	London BioMed Central 30.09.2023 BioMed Central Ltd Springer Nature B.V BMC
Subjects	Algorithms Arrays Bioinformatics Biomedical and Life Sciences Compression Compression ratio Computation Computational Biology/Bioinformatics Computer Appl. in Life Sciences CUDA Data compression Dictionaries DNA sequencing Gene sequencing Genetic research Genome compression Genomes Genomics Life Sciences Methods Microarrays Multiprocessing Next-generation sequencing Nucleotide sequence Nucleotide sequencing Parallel processing Parallelization Reference-based Suffix array Whole genome sequencing China CUDA Reference-based Suffix array Parallelization Genome compression
Online Access	Get full text
ISSN	1471-2105 1471-2105
DOI	10.1186/s12859-023-05500-z

Cover

More Information
Summary:	Background A large number of researchers have devoted to accelerating the speed of genome sequencing and reducing the cost of genome sequencing for decades, and they have made great strides in both areas, making it easier for researchers to study and analyze genome data. However, how to efficiently store and transmit the vast amount of genome data generated by high-throughput sequencing technologies has become a challenge for data compression researchers. Therefore, the research of genome data compression algorithms to facilitate the efficient representation of genome data has gradually attracted the attention of these researchers. Meanwhile, considering that the current computing devices have multiple cores, how to make full use of the advantages of the computing devices and improve the efficiency of parallel processing is also an important direction for designing genome compression algorithms. Results We proposed an algorithm (LMSRGC) based on reference genome sequences, which uses the suffix array (SA) and the longest common prefix (LCP) array to find the longest matched substrings (LMS) for the compression of genome data in FASTA format. The proposed algorithm utilizes the characteristics of SA and the LCP array to select all appropriate LMSs between the genome sequence to be compressed and the reference genome sequence and then utilizes LMSs to compress the target genome sequence. To speed up the operation of the algorithm, we use GPUs to parallelize the construction of SA, while using multiple threads to parallelize the creation of the LCP array and the filtering of LMSs. Conclusions Experiment results demonstrate that our algorithm is competitive with the current state-of-the-art algorithms in compression ratio and compression time.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1471-2105 1471-2105
DOI:	10.1186/s12859-023-05500-z