Efficient and low-complexity variable-to-variable length coding for DNA storage
Background Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of h consecutive identical bases (homopolymer constraint h ), and 2)...
Saved in:
| Published in | BMC bioinformatics Vol. 25; no. 1; pp. 320 - 14 |
|---|---|
| Main Authors | , |
| Format | Journal Article |
| Language | English |
| Published |
London
BioMed Central
01.10.2024
BioMed Central Ltd Springer Nature B.V BMC |
| Subjects | |
| Online Access | Get full text |
| ISSN | 1471-2105 1471-2105 |
| DOI | 10.1186/s12859-024-05943-y |
Cover
| Summary: | Background
Efficient DNA-based storage systems offer substantial capacity and longevity at reduced costs, addressing anticipated data growth. However, encoding data into DNA sequences is limited by two key constraints: 1) a maximum of
h
consecutive identical bases (homopolymer constraint
h
), and 2) a GC ratio between
[
0.5
-
c
GC
,
0.5
+
c
GC
]
(GC content constraint
c
GC
). Sequencing or synthesis errors tend to increase when these constraints are violated.
Results
In this research, we address a pure source coding problem in the context of DNA storage, considering both homopolymer and GC content constraints. We introduce a novel coding technique that adheres to these constraints while maintaining linear complexity for increased block lengths and achieving near-optimal rates. We demonstrate the effectiveness of the proposed method through experiments on both randomly generated data and existing files. For example, when
h
=
4
and
c
GC
=
0.05
, the rate reached 1.988, close to the theoretical limit of 1.990. The associated code can be accessed at GitHub.
Conclusion
We propose a variable-to-variable-length encoding method that does not rely on concatenating short predefined sequences, which achieves near-optimal rates. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
| ISSN: | 1471-2105 1471-2105 |
| DOI: | 10.1186/s12859-024-05943-y |