Enhancing diabetes risk prediction through focal active learning and machine learning models

To improve the effectiveness of diabetes risk prediction, this study proposes a novel method based on focal active learning strategies combined with machine learning models. Existing machine learning models often suffer from poor performance on imbalanced medical datasets, where minority class insta...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 20; no. 7; p. e0327120
Main Authors	Zhang, Wangyouchen, Xia, Zhenhua, Cai, Guoqing, Wang, Junhao, Dong, Xutao
Format	Journal Article
Language	English
Published	United States Public Library of Science 08.07.2025 Public Library of Science (PLoS)
Subjects	Accuracy Algorithms Biology and Life Sciences Body mass index Cardiovascular disease Chronic illnesses Classification Clustering Computer and Information Sciences Correlation analysis Data points Datasets Deep learning Diabetes Diabetes mellitus Diabetes Mellitus - diagnosis Diabetes Mellitus - epidemiology Diagnosis Engineering and Technology Error reduction Female Foci Glucose Health aspects Health risks Humans Hypertension Identification methods Labeling Learning algorithms Lifestyles Machine Learning Male Medical diagnosis Medicine and Health Sciences Neural networks Patients Physical Sciences Predictions Recall Regression analysis Research and Analysis Methods Resource allocation Risk Risk Assessment - methods Risk Factors Samples Support vector machines China
Online Access	Get full text
ISSN	1932-6203 1932-6203
DOI	10.1371/journal.pone.0327120

Cover

More Information
Summary:	To improve the effectiveness of diabetes risk prediction, this study proposes a novel method based on focal active learning strategies combined with machine learning models. Existing machine learning models often suffer from poor performance on imbalanced medical datasets, where minority class instances such as diabetic cases are underrepresented. Our proposed Focal Active Learning method selectively samples informative instances to mitigate this imbalance, leading to better prediction outcomes with fewer labeled samples. The method integrates SHAP (SHapley Additive Explanations) to quantify feature importance and applies attention mechanisms to dynamically adjust feature weights, enhancing model interpretability and performance in predicting diabetes risk. To address the issue of imbalanced classification in diabetes datasets, we employed a clustering-based method to identify representative data points (called foci), and iteratively constructed a smaller labeled dataset (sub-pool) around them using similarity-based sampling. This method aims to overcome common challenges, such as poor performance on minority classes and limited generalization, by enabling more efficient data utilization and reducing labeling costs. The experimental results demonstrated that our approach significantly improved the evaluation metrics for diabetes risk prediction, achieving an accuracy of 97.41% and a recall rate of 94.70%, clearly outperforming traditional models that typically achieve 95% accuracy and 92% recall. Additionally, the model's generalization ability was further validated on the public PIMA Indians Diabetes DataBase, outperforming traditional models in both accuracy and recall. This approach can enhance early diabetes screening in clinical settings, helping healthcare professionals reduce diagnostic errors and optimize resource allocation.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Competing Interests: The authors have declared that no competing interests exist.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0327120