CSS: Handling imbalanced data by improved clustering with stratified sampling
Summary The traditional support vector machine technique (SVM) has drawbacks in dealing with imbalanced data. To address this issue, in this paper we propose an algorithm of improved clustering with stratified sampling technique (CSS) to improve the classification performance of SVMs on imbalanced d...
Saved in:
| Published in | Concurrency and computation Vol. 34; no. 2 |
|---|---|
| Main Authors | , |
| Format | Journal Article |
| Language | English |
| Published |
Hoboken, USA
John Wiley & Sons, Inc
25.01.2022
Wiley Subscription Services, Inc |
| Subjects | |
| Online Access | Get full text |
| ISSN | 1532-0626 1532-0634 |
| DOI | 10.1002/cpe.6071 |
Cover
| Summary: | Summary
The traditional support vector machine technique (SVM) has drawbacks in dealing with imbalanced data. To address this issue, in this paper we propose an algorithm of improved clustering with stratified sampling technique (CSS) to improve the classification performance of SVMs on imbalanced datasets. Instead of applying a single type of sampling method as used in the literature, our algorithm treats different type of classes with different sampling methods. For minority classes, the algorithm uses oversampling method by adding noise which obeys normal distribution around every support vector to generate new samples. For majority classes, samples are first divided into different clusters by applying first the improved clustering by fast search to find of density peaks (CFSFDP) to obtain latent structure information in each majority class and then stratified sampling method is applied to extract samples from each subcluster of the majority class. Moreover, we further extend this method into an ensemble classifiers that use multiple base SVM classifiers for prediction. The experimental results of classification on several imbalanced classification datasets show that our CSS is more effective than the state‐of‐the‐art sampling methods. |
|---|---|
| Bibliography: | Funding information Key‐Area Research and Development of Guangdong Provice, 2020B01064003; Guangdong Basic and Applied Basic Research Foundation, 2019A1515010716; National Key Research and Development Plan's Key Special Program on High performance computing of China, 2017YFB0203201; National Natural Science Foundation of China, 61771347; Open fund of Guangdong Key Laboratory of digital signal and image processing technology, 2019GDDSIPL‐03; Science and Technology Project of Jiangmen City, 2019[184]; Special Project in key Areas of Artificial Intelligence in Guangdong Universities, 2019KZDZX1017 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 1532-0626 1532-0634 |
| DOI: | 10.1002/cpe.6071 |