Machine learning strategies for multi-label pre-diagnosis of diseases with superficial data

•General practice (GP) pre-diagnosis is a seemingly simple yet complex issue.•Deeply analyzed inherent problems in GP pre-diagnosis with limited data.•Provided two targeted machine learning strategies with unique capabilities.•Solved the high-dimensional and sparse issue in GP pre-diagnosis.•Achieve...

Full description

Saved in:

Bibliographic Details
Published in	Computer methods and programs in biomedicine Vol. 269; p. 108911
Main Authors	Gou, Dengqun, Luo, Xu, Liu, Zhichen
Format	Journal Article
Language	English
Published	Ireland Elsevier B.V 01.09.2025
Subjects	Algorithms Diagnosis, Computer-Assisted - methods Electronic Health Records General Practice General practice pre-diagnosis Hierarchical disease information Humans Machine Learning Neural Networks, Computer Retrospective Studies Superficial data Support Vector Machine Superficial data General practice pre-diagnosis Hierarchical disease information Machine learning
Online Access	Get full text
ISSN	0169-2607 1872-7565 1872-7565
DOI	10.1016/j.cmpb.2025.108911

Cover

More Information
Summary:	•General practice (GP) pre-diagnosis is a seemingly simple yet complex issue.•Deeply analyzed inherent problems in GP pre-diagnosis with limited data.•Provided two targeted machine learning strategies with unique capabilities.•Solved the high-dimensional and sparse issue in GP pre-diagnosis.•Achieved disease co-occurrence identification goals in GP pre-diagnosis. General practice (GP) pre-diagnosis, a key task in disease triage, directs patients to suitable departments despite limited data and multi-label classification challenges. To address this issue, a framework with dimensionality reduction machine learning strategies was provided. Disease information was organized into hierarchical tiers, focusing primarily on overarching disease classifications (I-level) and their subcategories (II-level). Two machine learning strategies were introduced and embedded into a framework. One was the classifier chain strategy, and the other one was ensemble learning-DNN (Deep Neural Networks) strategy. In classifier chains, the base candidate algorithms included XGBoost, RF (Random Forest), LR (Logistic Regression), and SVM (Support Vector Machine). In GP pre-diagnosis, the I-level and II-level disease information was progressively inferred. The efficacy of the methodologies was demonstrated through 3125 retrospective electronic medical records of patients complaining of abdominal pain. The performance metrics included AUPRC, AUROC, F1, accuracy, sensitivity, specificity, and hamming loss. The performance of different machine learning approaches was compared using the Friedman test, followed by the Nemenyi post-hoc test. The statistical results indicated that the Classifier chain-RF approach was optimal. For overarching disease categorizations, performance was excellent with nearly all metrics exceeding 0.90. For disease subcategories, performance slightly declined but remained highly effective, with most metrics surpassing 0.80. The proposed framework exhibited its efficacy by performing well across various metrics and successfully accomplishing the established objectives, contributing insights to computer-aided diagnosis in the specific area of GP pre-diagnosis. Classifier chain-RF is recommended as an embedding approach.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0169-2607 1872-7565 1872-7565
DOI:	10.1016/j.cmpb.2025.108911