A new approach for instance selection: Algorithms, evaluation, and comparisons

•We design two new algorithms using global density, relevant, irrelevant functions.•We develop a toolkit and its GUI, management and validation capabilities.•We evaluate and test the performance of our algorithms in terms of four metrics.•The experimental results prove our algorithms outperform dens...

Full description

Saved in:

Bibliographic Details
Published in	Expert systems with applications Vol. 149; p. 113297
Main Authors	Malhat, Mohamed, Menshawy, Mohamed El, Mousa, Hamdy, Sisi, Ashraf El
Format	Journal Article
Language	English
Published	New York Elsevier Ltd 01.07.2020 Elsevier BV
Subjects	Accuracy Algorithms Big Data Classification Complexity Computing time Data mining Datasets Global density function Instance selection Polynomials Reduction Standard data Time complexity Toolkits Instance selection Big data Global density function Data mining Time complexity
Online Access	Get full text
ISSN	0957-4174 1873-6793
DOI	10.1016/j.eswa.2020.113297

Cover

More Information
Summary:	•We design two new algorithms using global density, relevant, irrelevant functions.•We develop a toolkit and its GUI, management and validation capabilities.•We evaluate and test the performance of our algorithms in terms of four metrics.•The experimental results prove our algorithms outperform density-based approaches.•We test the scalability and compute the polynomial-time complexity of algorithms. Several approaches for instance selection have been put forward as a primary step to increase the efficiency and accuracy of algorithms applied to mine big data. The instance selection task scales indeed big data down by removing irrelevant, redundant, and unreliable data, which, in turn, reduces the computational resources necessary for completing the mining task. The local density-based approaches are recently acknowledged as feasible approaches in terms of reduction rate, effectiveness, and computation time metrics. However, these approaches endure low classification accuracy results compared with other approaches. In this manuscript, we propose a new layered and operational approach to address these limitations as well as advance the state-of-the-art by balancing among classification accuracy, reduction rate, and time complexity. We commence by designing a new algorithm (called GDIS) that selects most relevant instances using a global density and relevance functions. This enable us to consider a global view overall a data set to get a better classification accuracy results than current density-based approaches. We design another novel algorithm (called EGDIS), which maintains the effectiveness results of the GDIS algorithm while improving reduction rate results. Moreover, we compare our algorithms against three state-of-the-art algorithms to validate their performance. We develop a Java toolkit called ISTK on the top of the GDIS and EGDIS algorithms, the density-based approaches, and the state-of-the-art algorithms. We also develop a suitable user interface and its management and validation capabilities to ease-of-use and visualize results and data sets. We evaluate and test the performance of our algorithms in terms of four metrics (reduction rate, classification accuracy, effectiveness, and computation time) using twenty-four standard data sets and conduct an intensive set of experiments. The experimental results proved that the GDIS algorithm outperforms the density-based approaches in terms of classification accuracy and effectiveness, the EGDIS algorithm outperforms the density-based approaches in terms of reduction rate and effectiveness, and the GDIS and EGDIS algorithms outperform the state-of-the-art algorithms in terms of achieving a good results in both the effectiveness and computation time metrics. We finally test the scalability and compute experimentally the polynomial-time complexity of our algorithms.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2020.113297