Locality sensitive hashing for sampling-based algorithms in association rule mining

► A novel sampling approach with association rule mining can process very large databases in a reasonable time. ► Outliers are removed based on local properties of individual clusters. ► The proposed algorithms are shown to exhibit better accuracy or execution time than previously proposed algorithm...

Full description

Saved in:

Bibliographic Details
Published in	Expert systems with applications Vol. 38; no. 10; pp. 12388 - 12397
Main Authors	Chen, Chyouhwa, Horng, Shi-Jinn, Huang, Chin-Pin
Format	Journal Article
Language	English
Published	Elsevier Ltd 15.09.2011
Subjects	Algorithms Association rule mining Clustering Clusters Locality-sensitive hashing Mining Outlier removal Samples Sampling Statistical analysis Statistical methods Outlier removal Association rule mining Sampling Clustering Locality-sensitive hashing
Online Access	Get full text
ISSN	0957-4174 1873-6793
DOI	10.1016/j.eswa.2011.04.018

Cover

More Information
Summary:	► A novel sampling approach with association rule mining can process very large databases in a reasonable time. ► Outliers are removed based on local properties of individual clusters. ► The proposed algorithms are shown to exhibit better accuracy or execution time than previously proposed algorithms. Association rule mining is one of the most important techniques for intelligent system design and has been widely applied in a large number of real applications. However, classical mining algorithms cannot process very large databases in a reasonable amount of time. The sampling approach that processes a subset of the whole database is a viable alternative. Obviously, such an approach cannot extract perfectly accurate rules. Previous works have tried to improve the accuracy by removing “outliers” from the initial sample based on global statistical properties in the sample. In this paper, we take the view that the initial sample may actually consist of multiple possibly overlapping subsets or clusters. It is more reasonable to apply data clustering techniques to the initial sample before outlier removal is performed on the resulting clusters, so that outliers are removed based on local properties of individual clusters. However, clustering transactional data with very high dimensions is a difficult problem by itself. We solve this problem by interpreting locality sensitive hashing as a means for data clustering. Previously proposed algorithms may be then optionally used to remove the outliers in the individual clusters. We propose several concrete algorithms based on this general strategy. Using an extensive set of synthetic data and real datasets, we evaluate our proposed algorithms and find that our proposals exhibit better accuracy or execution time, or both, than previously proposed algorithms.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 ObjectType-Article-2 ObjectType-Feature-1
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2011.04.018