Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance

Heart disease, one of the main reasons behind the high mortality rate around the world, requires a sophisticated and expensive diagnosis process. In the recent past, much literature has demonstrated machine learning approaches as an opportunity to efficiently diagnose heart disease patients. However...

Full description

Saved in:

Bibliographic Details
Published in	Technologies (Basel) Vol. 9; no. 3; p. 52
Main Authors	Ahsan, Md, Mahmud, M., Saha, Pritom, Gupta, Kishor, Siddique, Zahed
Format	Journal Article
Language	English
Published	Basel MDPI AG 01.09.2021
Subjects	Accuracy Algorithms Artificial intelligence automated model Cardiovascular disease Classifiers Coronaviruses COVID-19 Data analysis Data conversion data scaling Datasets Decision support systems Decision trees Deep learning Diagnosis Discriminant analysis Distance learning Feature selection Heart heart disease Heart diseases Machine learning machine learning algorithm Medical diagnosis Medical research Missing data Patients prediction Regression analysis Robustness (mathematics) Scaling Severe acute respiratory syndrome coronavirus 2 Statistical analysis Support vector machines
Online Access	Get full text
ISSN	2227-7080 2227-7080
DOI	10.3390/technologies9030052

Cover

More Information
Summary:	Heart disease, one of the main reasons behind the high mortality rate around the world, requires a sophisticated and expensive diagnosis process. In the recent past, much literature has demonstrated machine learning approaches as an opportunity to efficiently diagnose heart disease patients. However, challenges associated with datasets such as missing data, inconsistent data, and mixed data (containing inconsistent missing data both as numerical and categorical) are often obstacles in medical diagnosis. This inconsistency led to a higher probability of misprediction and a misled result. Data preprocessing steps like feature reduction, data conversion, and data scaling are employed to form a standard dataset—such measures play a crucial role in reducing inaccuracy in final prediction. This paper aims to evaluate eleven machine learning (ML) algorithms—Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Naive Bayes (NB), Support Vector Machine (SVM), XGBoost (XGB), Random Forest Classifier (RF), Gradient Boost (GB), AdaBoost (AB), Extra Tree Classifier (ET)—and six different data scaling methods—Normalization (NR), Standscale (SS), MinMax (MM), MaxAbs (MA), Robust Scaler (RS), and Quantile Transformer (QT) on a dataset comprising of information of patients with heart disease. The result shows that CART, along with RS or QT, outperforms all other ML algorithms with 100% accuracy, 100% precision, 99% recall, and 100% F1 score. The study outcomes demonstrate that the model’s performance varies depending on the data scaling method.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2227-7080 2227-7080
DOI:	10.3390/technologies9030052