Machine learning strategies for multi-label pre-diagnosis of diseases with superficial data
•General practice (GP) pre-diagnosis is a seemingly simple yet complex issue.•Deeply analyzed inherent problems in GP pre-diagnosis with limited data.•Provided two targeted machine learning strategies with unique capabilities.•Solved the high-dimensional and sparse issue in GP pre-diagnosis.•Achieve...
Saved in:
| Published in | Computer methods and programs in biomedicine Vol. 269; p. 108911 |
|---|---|
| Main Authors | , , |
| Format | Journal Article |
| Language | English |
| Published |
Ireland
Elsevier B.V
01.09.2025
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 0169-2607 1872-7565 1872-7565 |
| DOI | 10.1016/j.cmpb.2025.108911 |
Cover
| Summary: | •General practice (GP) pre-diagnosis is a seemingly simple yet complex issue.•Deeply analyzed inherent problems in GP pre-diagnosis with limited data.•Provided two targeted machine learning strategies with unique capabilities.•Solved the high-dimensional and sparse issue in GP pre-diagnosis.•Achieved disease co-occurrence identification goals in GP pre-diagnosis.
General practice (GP) pre-diagnosis, a key task in disease triage, directs patients to suitable departments despite limited data and multi-label classification challenges. To address this issue, a framework with dimensionality reduction machine learning strategies was provided.
Disease information was organized into hierarchical tiers, focusing primarily on overarching disease classifications (I-level) and their subcategories (II-level). Two machine learning strategies were introduced and embedded into a framework. One was the classifier chain strategy, and the other one was ensemble learning-DNN (Deep Neural Networks) strategy. In classifier chains, the base candidate algorithms included XGBoost, RF (Random Forest), LR (Logistic Regression), and SVM (Support Vector Machine). In GP pre-diagnosis, the I-level and II-level disease information was progressively inferred. The efficacy of the methodologies was demonstrated through 3125 retrospective electronic medical records of patients complaining of abdominal pain. The performance metrics included AUPRC, AUROC, F1, accuracy, sensitivity, specificity, and hamming loss. The performance of different machine learning approaches was compared using the Friedman test, followed by the Nemenyi post-hoc test.
The statistical results indicated that the Classifier chain-RF approach was optimal. For overarching disease categorizations, performance was excellent with nearly all metrics exceeding 0.90. For disease subcategories, performance slightly declined but remained highly effective, with most metrics surpassing 0.80.
The proposed framework exhibited its efficacy by performing well across various metrics and successfully accomplishing the established objectives, contributing insights to computer-aided diagnosis in the specific area of GP pre-diagnosis. Classifier chain-RF is recommended as an embedding approach. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ISSN: | 0169-2607 1872-7565 1872-7565 |
| DOI: | 10.1016/j.cmpb.2025.108911 |