A cost model and index architecture for the similarity join

The similarity join is an important database primitive which has been successfully applied to speed up data mining algorithms. In the similarity join, two point sets of a multidimensional vector space are combined such that the result contains all point pairs where the distance does not exceed a par...

Full description

Saved in:
Bibliographic Details
Published inProceedings 17th International Conference on Data Engineering pp. 411 - 420
Main Authors Bohm, C., Kriegel, H.-P.
Format Conference Proceeding
LanguageEnglish
Published IEEE 2001
Subjects
Online AccessGet full text
ISBN0769510019
9780769510019
ISSN1063-6382
DOI10.1109/ICDE.2001.914854

Cover

More Information
Summary:The similarity join is an important database primitive which has been successfully applied to speed up data mining algorithms. In the similarity join, two point sets of a multidimensional vector space are combined such that the result contains all point pairs where the distance does not exceed a parameter /spl epsiv/. Due to its high practical relevance, many similarity join algorithms have been devised. The authors propose an analytical cost model for the similarity join operation based on indexes. Our problem analysis reveals a serious optimization conflict between CPU time and I/O time: fine-grained index structures are beneficial for CPU efficiency, but deteriorate the I/O performance. As a consequence of this observation, we propose a new index architecture and join algorithm which allows a separate optimization of CPU time and I/O time. Our solution utilizes large pages which are optimized for I/O processing. The pages accommodate a search structure which minimizes the computational effort in the experimental evaluation, and a substantial improvement over competitive techniques is shown.
ISBN:0769510019
9780769510019
ISSN:1063-6382
DOI:10.1109/ICDE.2001.914854