A Highly Discriminative Hybrid Feature Selection Algorithm for Cancer Diagnosis

Cancer is a deadly disease that occurs due to rapid and uncontrolled cell growth. In this article, a machine learning (ML) algorithm is proposed to diagnose different cancer diseases from big data. The algorithm comprises a two-stage hybrid feature selection. In the first stage, an overall ranker is...

Full description

Saved in:

Bibliographic Details
Published in	TheScientificWorld Vol. 2022; pp. 1 - 15
Main Authors	Elemam, Tarneem, Elshrkawey, Mohamed
Format	Journal Article
Language	English
Published	Cairo Hindawi 09.08.2022 John Wiley & Sons, Inc Wiley
Subjects	Accuracy Algorithms Analysis Big data Biomarkers Cancer cell growth Chi-square test Classification data collection Datasets decision support systems Decision trees Diagnosis Diagnostic systems Discriminant analysis Disease Feature selection Gene expression Leukemia Lung cancer lung neoplasms Machine learning Medical diagnosis Methods microarray technology Neural networks Optimization algorithms Ovarian cancer ovarian neoplasms Performance evaluation Support vector machines Tumors
Online Access	Get full text
ISSN	2356-6140 1537-744X 1537-744X
DOI	10.1155/2022/1056490

Cover

More Information
Summary:	Cancer is a deadly disease that occurs due to rapid and uncontrolled cell growth. In this article, a machine learning (ML) algorithm is proposed to diagnose different cancer diseases from big data. The algorithm comprises a two-stage hybrid feature selection. In the first stage, an overall ranker is initiated to combine the results of three filter-based feature evaluation methods, namely, chi-squared, F-statistic, and mutual information (MI). The features are then ordered according to this combination. In the second stage, the modified wrapper-based sequential forward selection is utilized to discover the optimal feature subset, using ML models such as support vector machine (SVM), decision tree (DT), random forest (RF), and K-nearest neighbor (KNN) classifiers. To examine the proposed algorithm, many tests have been carried out on four cancerous microarray datasets, employing in the process 10-fold cross-validation and hyperparameter tuning. The performance of the algorithm is evaluated by calculating the diagnostic accuracy. The results indicate that for the leukemia dataset, both SVM and KNN models register the highest accuracy at 100% using only 5 features. For the ovarian cancer dataset, the SVM model achieves the highest accuracy at 100% using only 6 features. For the small round blue cell tumor (SRBCT) dataset, the SVM model also achieves the highest accuracy at 100% using only 8 features. For the lung cancer dataset, the SVM model also achieves the highest accuracy at 99.57% using 19 features. By comparing with other algorithms, the results obtained from the proposed algorithm are superior in terms of the number of selected features and diagnostic accuracy.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Academic Editor: Juan Mejía-Aranguré
ISSN:	2356-6140 1537-744X 1537-744X
DOI:	10.1155/2022/1056490