The GCTx format and cmap{Py, R, M, J} packages: resources for optimized storage and integrated traversal of annotated dense matrices

Abstract Motivation Facilitated by technological improvements, pharmacologic and genetic perturbational datasets have grown in recent years to include millions of experiments. Sharing and publicly distributing these diverse data creates many opportunities for discovery, but in recent years the unpre...

Full description

Saved in:
Bibliographic Details
Published inBioinformatics Vol. 35; no. 8; pp. 1427 - 1429
Main Authors Enache, Oana M, Lahr, David L, Natoli, Ted E, Litichevskiy, Lev, Wadden, David, Flynn, Corey, Gould, Joshua, Asiedu, Jacob K, Narayan, Rajiv, Subramanian, Aravind
Format Journal Article
LanguageEnglish
Published England Oxford University Press 15.04.2019
Subjects
Online AccessGet full text
ISSN1367-4803
1367-4811
1460-2059
1367-4811
DOI10.1093/bioinformatics/bty784

Cover

More Information
Summary:Abstract Motivation Facilitated by technological improvements, pharmacologic and genetic perturbational datasets have grown in recent years to include millions of experiments. Sharing and publicly distributing these diverse data creates many opportunities for discovery, but in recent years the unprecedented size of data generated and its complex associated metadata have also created data storage and integration challenges. Results We present the GCTx file format and a suite of open-source packages for the efficient storage, serialization and analysis of dense two-dimensional matrices. We have extensively used the format in the Connectivity Map to assemble and share massive datasets currently comprising 1.3 million experiments, and we anticipate that the format’s generalizability, paired with code libraries that we provide, will lower barriers for integrated cross-assay analysis and algorithm development. Availability and implementation Software packages (available in Python, R, Matlab and Java) are freely available at https://github.com/cmap. Additional instructions, tutorials and datasets are available at clue.io/code. Supplementary information Supplementary data are available at Bioinformatics online.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1367-4803
1367-4811
1460-2059
1367-4811
DOI:10.1093/bioinformatics/bty784