A Python Clustering Analysis Protocol of Genes Expression Data Sets

Gene expression and SNPs data hold great potential for a new understanding of disease prognosis, drug sensitivity, and toxicity evaluations. Cluster analysis is used to analyze data that do not contain any specific subgroups. The goal is to use the data itself to recognize meaningful and informative...

Full description

Saved in:

Bibliographic Details
Published in	Genes Vol. 13; no. 10; p. 1839
Main Authors	Agapito, Giuseppe, Milano, Marianna, Cannataro, Mario
Format	Journal Article
Language	English
Published	Switzerland MDPI AG 12.10.2022 MDPI
Subjects	Algorithms Analysis Animals Boidae Cluster Analysis Clustering Computer applications Data Analysis Datasets DNA microarrays Gene expression Genomes Hybridization Information management Medical research Methods Microarray Analysis Phenotypes Prediction models Sensitivity analysis Single-nucleotide polymorphism Toxicity Italy SNPs DEGs clustering data mining microarrays unsupervised learning
Online Access	Get full text
ISSN	2073-4425 2073-4425
DOI	10.3390/genes13101839

Cover

More Information
Summary:	Gene expression and SNPs data hold great potential for a new understanding of disease prognosis, drug sensitivity, and toxicity evaluations. Cluster analysis is used to analyze data that do not contain any specific subgroups. The goal is to use the data itself to recognize meaningful and informative subgroups. In addition, cluster investigation helps data reduction purposes, exposes hidden patterns, and generates hypotheses regarding the relationship between genes and phenotypes. Cluster analysis could also be used to identify bio-markers and yield computational predictive models. The methods used to analyze microarrays data can profoundly influence the interpretation of the results. Therefore, a basic understanding of these computational tools is necessary for optimal experimental design and meaningful data analysis. This manuscript provides an analysis protocol to effectively analyze gene expression data sets through the K-means and DBSCAN algorithms. The general protocol enables analyzing omics data to identify subsets of features with low redundancy and high robustness, speeding up the identification of new bio-markers through pathway enrichment analysis. In addition, to demonstrate the effectiveness of our clustering analysis protocol, we analyze a real data set from the GEO database. Finally, the manuscript provides some best practice and tips to overcome some issues in the analysis of omics data sets through unsupervised learning.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2073-4425 2073-4425
DOI:	10.3390/genes13101839