Automated labeling in medical data: A semi-supervised density-based approach for efficient diagnosis model development
In the rapidly expanding landscape of medical data acquisition, the demand for automated diagnosis and analysis models is paramount to support healthcare practitioners. Providing a learning model for automatic diagnosis and analysis is a necessity to support them. To formulate a diagnosis model, lab...
Saved in:
| Published in | Computers in biology and medicine Vol. 197; no. Pt A; p. 110963 |
|---|---|
| Main Authors | , , , , , |
| Format | Journal Article |
| Language | English |
| Published |
United States
Elsevier Ltd
01.10.2025
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 0010-4825 1879-0534 1879-0534 |
| DOI | 10.1016/j.compbiomed.2025.110963 |
Cover
| Summary: | In the rapidly expanding landscape of medical data acquisition, the demand for automated diagnosis and analysis models is paramount to support healthcare practitioners. Providing a learning model for automatic diagnosis and analysis is a necessity to support them. To formulate a diagnosis model, labeling the entire data manually is necessary. Machine learning and human intervention tasks are demanding, expensive, and error-prone.
To simplify the above specified effort, the presented work aimed to improve the performance of semi-supervised learning by automating the labeling process and thus decreasing the development cost. The same is demonstrated using benchmarked medical datasets, which have only a small subset of the labeled data samples. Effective labeling is incorporated through the identification of peak density samples and the construction of the density clusters from the unlabeled data. The distribution of samples within the clusters are further analyzed to identify the high and low confidence regions. The samples within the regions are appended to the labeled dataset and are mapped to the class of the peak sample. This smaller subset of the data is selected for manual labeling which can then be leveraged to propagate labels to the rest of the data, thus minimizing the project budget.
The results suggest that the proposed SSDCCR- Semi - Supervised Density Based Clustering with a confidence region outperforms existing algorithms across multiple health datasets with a significant increase of at least 2 percent in accuracy. The algorithm approach is scalable to larger datasets and memory efficient with less complexity.
•This work improves the semi-supervised learning by automating data labeling, reducing costs & minimizing manual annotation.•It uses a peak density samples & density clustering to effectively label data by identifying high and low confidence regions.•The proposed SSDCCR algorithm shows significant performance improvements over previous machine learning methods. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ISSN: | 0010-4825 1879-0534 1879-0534 |
| DOI: | 10.1016/j.compbiomed.2025.110963 |