Scalable transcriptomics analysis with Dask: applications in data science and machine learning
Background Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profiles helps derive signatures for the prediction, diagnosis and prognosis of different diseases. Data science and specifically machine learning have many applications i...
        Saved in:
      
    
          | Published in | BMC bioinformatics Vol. 23; no. 1; pp. 514 - 20 | 
|---|---|
| Main Authors | , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
        London
          BioMed Central
    
        30.11.2022
     Springer Nature B.V BMC  | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 1471-2105 1471-2105  | 
| DOI | 10.1186/s12859-022-05065-3 | 
Cover
| Summary: | Background
Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profiles helps derive signatures for the prediction, diagnosis and prognosis of different diseases. Data science and specifically machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary.
Methods
In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefits of the
Dask
framework and how it can be integrated with the Python scientific environment to perform data analysis in computational biology and bioinformatics.
Results
This review illustrates the role of
Dask
for boosting data science applications in different case studies. Detailed documentation and code on these procedures is made available at
https://github.com/martaccmoreno/gexp-ml-dask
.
Conclusion
By showing when and how
Dask
can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures. | 
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 ObjectType-Article-2 ObjectType-Feature-3 content type line 23 ObjectType-Review-1  | 
| ISSN: | 1471-2105 1471-2105  | 
| DOI: | 10.1186/s12859-022-05065-3 |