A cluster-based ensemble approach for congenital heart disease prediction
•Developed prediction model for congenital heart disease.•A cluster based oversampling approach has been proposed.•Captures intricate details from the mothers’ lifestyle dataset.•The proposed clustering-based approach gave the highest accuracy.•The cluster information enhanced the classification pre...
Saved in:
| Published in | Computer methods and programs in biomedicine Vol. 243; p. 107922 |
|---|---|
| Main Authors | , |
| Format | Journal Article |
| Language | English |
| Published |
Elsevier B.V
01.01.2024
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 0169-2607 1872-7565 1872-7565 |
| DOI | 10.1016/j.cmpb.2023.107922 |
Cover
| Summary: | •Developed prediction model for congenital heart disease.•A cluster based oversampling approach has been proposed.•Captures intricate details from the mothers’ lifestyle dataset.•The proposed clustering-based approach gave the highest accuracy.•The cluster information enhanced the classification prediction.
One of the most prevalent birth disorders is congenital heart diseases (CHD). Although CHD risk factors have been the subject of numerous studies, their propensity to cause CHD has not been tested. Particularly few research has attempted to forecast CHD risk using population-based cross-sectional data, which is inherently imbalanced.
The main goals of this study are to create a reliable data analysis model that can help with (i) a better understanding of congenital heart disease prediction in the presence of missing and unbalanced data and (ii) creating cohorts of expectant mothers with similar lifestyle characteristics.
Clusters of patient cohorts are produced using the unsupervised data mining technique density-based spatial clustering of applications with noise (DBSCAN). For more accurate CHD prediction, a random forest model was trained using these clusters and their corresponding patterns. This study uses a dataset of 33,831 expectant mothers to make its prediction. Missing data were handled using the k-NN imputation approach, while extremely unbalanced data were balanced using SMOTE. These techniques are all data-driven and need little to no user or expert involvement.
Using DBSCAN, three cohorts were found. The cluster information enhanced the random forest-based CHD prediction and revealed intricate factors that influence prediction accuracy. The proposed approach gave the highest results with 99 % accuracy and 0.91 AUC and performed better than the state-of-the-art methodologies. Hence, the suggested method using unsupervised learning can provide intricate information to the classifier and further enhance the performance of the classification. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ISSN: | 0169-2607 1872-7565 1872-7565 |
| DOI: | 10.1016/j.cmpb.2023.107922 |