Robust and efficient identification of biomarkers by classifying features on graphs

Motivation: A central problem in biomarker discovery from large-scale gene expression or single nucleotide polymorphism (SNP) data is the computational challenge of taking into account the dependence among all the features. Methods that ignore the dependence usually identify non-reproducible biomark...

Full description

Saved in:

Bibliographic Details
Published in	Bioinformatics Vol. 24; no. 18; pp. 2023 - 2029
Main Authors	Hwang, TaeHyun, Sicotte, Hugues, Tian, Ze, Wu, Baolin, Kocher, Jean-Pierre, Wigle, Dennis A., Kumar, Vipin, Kuang, Rui
Format	Journal Article
Language	English
Published	Oxford Oxford University Press 15.09.2008 Oxford Publishing Limited (England)
Subjects	Algorithms Bioinformatics Biological and medical sciences Biomarkers, Tumor - chemistry Biomarkers, Tumor - genetics Breast Neoplasms - genetics Computational Biology - methods Databases, Protein Female Fundamental and applied biological sciences. Psychology Gene Expression Profiling - methods General aspects Humans Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) Neoplasm Proteins - genetics Biological marker Identification Graph Bioinformatics
Online Access	Get full text
ISSN	1367-4803 1367-4811 1460-2059 1367-4811
DOI	10.1093/bioinformatics/btn383

Cover

More Information
Summary:	Motivation: A central problem in biomarker discovery from large-scale gene expression or single nucleotide polymorphism (SNP) data is the computational challenge of taking into account the dependence among all the features. Methods that ignore the dependence usually identify non-reproducible biomarkers across independent datasets. We introduce a new graph-based semi-supervised feature classification algorithm to identify discriminative disease markers by learning on bipartite graphs. Our algorithm directly classifies the feature nodes in a bipartite graph as positive, negative or neutral with network propagation to capture the dependence among both samples and features (clinical and genetic variables) by exploring bi-cluster structures in a graph. Two features of our algorithm are: (1) our algorithm can find a global optimal labeling to capture the dependence among all the features and thus, generates highly reproducible results across independent microarray or other high-thoughput datasets, (2) our algorithm is capable of handling hundreds of thousands of features and thus, is particularly useful for biomarker identification from high-throughput gene expression and SNP data. In addition, although designed for classifying features, our algorithm can also simultaneously classify test samples for disease prognosis/diagnosis. Results: We applied the network propagation algorithm to study three large-scale breast cancer datasets. Our algorithm achieved competitive classification performance compared with SVMs and other baseline methods, and identified several markers with clinical or biological relevance with the disease. More importantly, our algorithm also identified highly reproducible marker genes and enriched functions from the independent datasets. Availability: Supplementary results and source code are available at http://compbio.cs.umn.edu/Feature_Class. Contact: kuang@cs.umn.edu Supplementary information: Supplementary data are available at Bioinformatics online.
Bibliography:	istex:5A44DA329A6E867FD06E1E07B8E3198163C61F5A To whom correspondence should be addressed. ArticleID:btn383 ark:/67375/HXZ-D89CG000-1 Associate Editor: Joaquin Dopazo ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Undefined-1 ObjectType-Feature-3 content type line 23
ISSN:	1367-4803 1367-4811 1460-2059 1367-4811
DOI:	10.1093/bioinformatics/btn383