Foldcomp: a library and format for compressing and indexing large protein structure sets
Abstract Summary Highly accurate protein structure predictors have generated hundreds of millions of protein structures; these pose a challenge in terms of storage and processing. Here, we present Foldcomp, a novel lossy structure compression algorithm, and indexing system to address this challenge....
Saved in:
| Published in | Bioinformatics (Oxford, England) Vol. 39; no. 4 |
|---|---|
| Main Authors | , , |
| Format | Journal Article |
| Language | English |
| Published |
England
Oxford University Press
03.04.2023
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 1367-4811 1367-4803 1367-4811 |
| DOI | 10.1093/bioinformatics/btad153 |
Cover
| Summary: | Abstract
Summary
Highly accurate protein structure predictors have generated hundreds of millions of protein structures; these pose a challenge in terms of storage and processing. Here, we present Foldcomp, a novel lossy structure compression algorithm, and indexing system to address this challenge. By using a combination of internal and Cartesian coordinates and a bi-directional NeRF-based strategy, Foldcomp improves the compression ratio by a factor of three compared to the next best method. Its reconstruction error of 0.08 Å is comparable to the best lossy compressor. It is five times faster than the next fastest compressor and competes with the fastest decompressors. With its multi-threading implementation and a Python interface that allows for easy database downloads and efficient querying of protein structures by accession, Foldcomp is a powerful tool for managing and analysing large collections of protein structures.
Availability and implementation
Foldcomp is a free open-source software (GPLv3) and available for Linux, macOS, and Windows at https://foldcomp.foldseek.com. Foldcomp provides the AlphaFold Swiss-Prot (2.9GB), TrEMBL (1.1TB), and ESMatlas HQ (114GB) database ready-for-download. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ISSN: | 1367-4811 1367-4803 1367-4811 |
| DOI: | 10.1093/bioinformatics/btad153 |