High-dimensional imbalanced biomedical data classification based on P-AdaBoost-PAUC algorithm

High-dimensional imbalanced biomedical data has dual characteristics of high-dimensional and imbalanced distribution. It is important to improve classification accuracy by filtering out low-dimensional feature subsets that are highly correlated with the classification target and have minimal mutual...

Full description

Saved in:

Bibliographic Details
Published in	The Journal of supercomputing Vol. 78; no. 14; pp. 16581 - 16604
Main Authors	Li, Xiao, Li, Kewen
Format	Journal Article
Language	English
Published	New York Springer US 01.09.2022 Springer Nature B.V
Subjects	Algorithms Biomedical data Classification Compilers Computer Science Decision trees Interpreters Optimization Processor Architectures Programming Languages Recall Redundancy Statistical analysis Imbalanced data Feature selection Adaptive Boosting Pearson AUC
Online Access	Get full text
ISSN	0920-8542 1573-0484
DOI	10.1007/s11227-022-04509-0

Cover

More Information
Summary:	High-dimensional imbalanced biomedical data has dual characteristics of high-dimensional and imbalanced distribution. It is important to improve classification accuracy by filtering out low-dimensional feature subsets that are highly correlated with the classification target and have minimal mutual redundancy. However, traditional feature selection algorithms tend to select the feature subset that is favorable to class with large sample size, resulting in poor classification performance for minority samples. In response to the above problems, the P-AdaBoost-PAUC algorithm is proposed to be applied to high-dimensional imbalanced biomedical data classification. The idea of P-AdaBoost-PAUC algorithm has two major contributions. The first is that an improved decision tree attribute optimization algorithm (DT-P) is proposed, which pays more attention to the correlation among attributes. The second is that an improved AdaBoost algorithm based on probabilistic AUC (AdaBoost-PAUC) is proposed, which comprehensively considers misclassification probability and AUC to pay more attention to minority samples. An ensemble algorithm for high-dimensional imbalanced biomedical data classification is formed, which is conducive to improve classification performance. Experimental results show that Recall, Specificity, F1, and AUC values of P-AdaBoost-PAUC ensemble algorithm have reached the highest values on datasets with different imbalance rate. Especially when the proportion of minority samples is only 12.6 % , Recall, Specificity, F1 and AUC values all reached above 0.95. And algorithm stability experiments show that P-AdaBoost-PAUC algorithm is more stable than other algorithms. Therefore, the P-AdaBoost-PAUC ensemble algorithm proposed in this paper improves classification performance of minority samples on high-dimensional imbalanced biomedical data to a certain extent.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0920-8542 1573-0484
DOI:	10.1007/s11227-022-04509-0