A new approach for instance selection: Algorithms, evaluation, and comparisons
•We design two new algorithms using global density, relevant, irrelevant functions.•We develop a toolkit and its GUI, management and validation capabilities.•We evaluate and test the performance of our algorithms in terms of four metrics.•The experimental results prove our algorithms outperform dens...
Saved in:
| Published in | Expert systems with applications Vol. 149; p. 113297 |
|---|---|
| Main Authors | , , , |
| Format | Journal Article |
| Language | English |
| Published |
New York
Elsevier Ltd
01.07.2020
Elsevier BV |
| Subjects | |
| Online Access | Get full text |
| ISSN | 0957-4174 1873-6793 |
| DOI | 10.1016/j.eswa.2020.113297 |
Cover
| Summary: | •We design two new algorithms using global density, relevant, irrelevant functions.•We develop a toolkit and its GUI, management and validation capabilities.•We evaluate and test the performance of our algorithms in terms of four metrics.•The experimental results prove our algorithms outperform density-based approaches.•We test the scalability and compute the polynomial-time complexity of algorithms.
Several approaches for instance selection have been put forward as a primary step to increase the efficiency and accuracy of algorithms applied to mine big data. The instance selection task scales indeed big data down by removing irrelevant, redundant, and unreliable data, which, in turn, reduces the computational resources necessary for completing the mining task. The local density-based approaches are recently acknowledged as feasible approaches in terms of reduction rate, effectiveness, and computation time metrics. However, these approaches endure low classification accuracy results compared with other approaches.
In this manuscript, we propose a new layered and operational approach to address these limitations as well as advance the state-of-the-art by balancing among classification accuracy, reduction rate, and time complexity. We commence by designing a new algorithm (called GDIS) that selects most relevant instances using a global density and relevance functions. This enable us to consider a global view overall a data set to get a better classification accuracy results than current density-based approaches. We design another novel algorithm (called EGDIS), which maintains the effectiveness results of the GDIS algorithm while improving reduction rate results. Moreover, we compare our algorithms against three state-of-the-art algorithms to validate their performance. We develop a Java toolkit called ISTK on the top of the GDIS and EGDIS algorithms, the density-based approaches, and the state-of-the-art algorithms. We also develop a suitable user interface and its management and validation capabilities to ease-of-use and visualize results and data sets. We evaluate and test the performance of our algorithms in terms of four metrics (reduction rate, classification accuracy, effectiveness, and computation time) using twenty-four standard data sets and conduct an intensive set of experiments. The experimental results proved that the GDIS algorithm outperforms the density-based approaches in terms of classification accuracy and effectiveness, the EGDIS algorithm outperforms the density-based approaches in terms of reduction rate and effectiveness, and the GDIS and EGDIS algorithms outperform the state-of-the-art algorithms in terms of achieving a good results in both the effectiveness and computation time metrics. We finally test the scalability and compute experimentally the polynomial-time complexity of our algorithms. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 0957-4174 1873-6793 |
| DOI: | 10.1016/j.eswa.2020.113297 |