Ensemble Gain Ratio Feature Selection (EGFS) Model with Machine Learning and Data Mining Algorithms for Disease Risk Prediction

Machine Learning (ML) and Data Mining (DM) play a vital role in enhancing the performance of tasks such as disease risk prediction in healthcare communities, resulting in better serving of the societies. A chance of 12% error remains in the diagnosis of the diseases by the medical practitioners as p...

Full description

Saved in:
Bibliographic Details
Published in2020 International Conference on Inventive Computation Technologies (ICICT) pp. 590 - 596
Main Authors Pasha, Syed Javeed, Mohamed, E.Syed
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.02.2020
Subjects
Online AccessGet full text
DOI10.1109/ICICT48043.2020.9112406

Cover

More Information
Summary:Machine Learning (ML) and Data Mining (DM) play a vital role in enhancing the performance of tasks such as disease risk prediction in healthcare communities, resulting in better serving of the societies. A chance of 12% error remains in the diagnosis of the diseases by the medical practitioners as proven in the literature works. To reduce the error rate and further improve the performance, a novel Ensemble Gain ratio Feature Selection (EGFS) model is introduced to extract the most important features, which are highly contributing. The accuracy, Area Under Curve (AUC), and other evaluation metrics are used instead of only the accuracy as it results in a misleading prediction for an imbalanced dataset and may provide wrong diagnosis causing serious damage to the patient's health or even losing lives. The thyroid disease dataset of UCI ML repository is used in the experiment. The EGFS model that consists of an ensemble algorithm i.e., the random forest and the gain ration algorithm, finds the most relevant and contributing features, is then aligned with the ML and DM algorithms such as k-nearest-neighbor, logistic regression, and naïve bayes. The highest accuracy achieved by the proposed EGFS model is 96.49% and the highest AUC recorded is 99.10%. These results significantly improve the disease risk prediction and are higher than many recent research works, while utilizing only four most relevant features out of the twenty eight features present in the dataset, i.e., the percentage of features reduced is 85.71%.
DOI:10.1109/ICICT48043.2020.9112406