Automated labeling in medical data: A semi-supervised density-based approach for efficient diagnosis model development

In the rapidly expanding landscape of medical data acquisition, the demand for automated diagnosis and analysis models is paramount to support healthcare practitioners. Providing a learning model for automatic diagnosis and analysis is a necessity to support them. To formulate a diagnosis model, lab...

Full description

Saved in:

Bibliographic Details
Published in	Computers in biology and medicine Vol. 197; no. Pt A; p. 110963
Main Authors	Mathews, Lincy Meera, Haneesh, Inaguri Muni Sai, Mani Sekhar, S.R., Anvitha, A.L., Shubhan, Amith, Akash, S.
Format	Journal Article
Language	English
Published	United States Elsevier Ltd 01.10.2025
Subjects	Algorithms Confidence regions Databases, Factual Density based clustering Diagnosis, Computer-Assisted - methods Humans Internal Medicine Machine Learning Other Peak samples Semi Supervised learning Supervised Machine Learning Confidence regions Peak samples Semi Density based clustering Supervised learning
Online Access	Get full text
ISSN	0010-4825 1879-0534 1879-0534
DOI	10.1016/j.compbiomed.2025.110963

Cover

More Information
Summary:	In the rapidly expanding landscape of medical data acquisition, the demand for automated diagnosis and analysis models is paramount to support healthcare practitioners. Providing a learning model for automatic diagnosis and analysis is a necessity to support them. To formulate a diagnosis model, labeling the entire data manually is necessary. Machine learning and human intervention tasks are demanding, expensive, and error-prone. To simplify the above specified effort, the presented work aimed to improve the performance of semi-supervised learning by automating the labeling process and thus decreasing the development cost. The same is demonstrated using benchmarked medical datasets, which have only a small subset of the labeled data samples. Effective labeling is incorporated through the identification of peak density samples and the construction of the density clusters from the unlabeled data. The distribution of samples within the clusters are further analyzed to identify the high and low confidence regions. The samples within the regions are appended to the labeled dataset and are mapped to the class of the peak sample. This smaller subset of the data is selected for manual labeling which can then be leveraged to propagate labels to the rest of the data, thus minimizing the project budget. The results suggest that the proposed SSDCCR- Semi - Supervised Density Based Clustering with a confidence region outperforms existing algorithms across multiple health datasets with a significant increase of at least 2 percent in accuracy. The algorithm approach is scalable to larger datasets and memory efficient with less complexity. •This work improves the semi-supervised learning by automating data labeling, reducing costs & minimizing manual annotation.•It uses a peak density samples & density clustering to effectively label data by identifying high and low confidence regions.•The proposed SSDCCR algorithm shows significant performance improvements over previous machine learning methods.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0010-4825 1879-0534 1879-0534
DOI:	10.1016/j.compbiomed.2025.110963