Dual incremental fuzzy schemes for frequent itemsets discovery in streaming numeric data

•There is no need to re-visit previous batches of numeric data.•The consumed time and the estimated error of the proposed two schemes, which is stable with the number of data increasing, is much less than traditional method.•Approximate support values of item-sets are proved in this paper, and also...

Full description

Saved in:
Bibliographic Details
Published inInformation sciences Vol. 514; pp. 15 - 43
Main Authors Zheng, Hui, Li, Peng, Liu, Qing, Chen, Jinjun, Huang, Guangli, Wu, Junfeng, Xue, Yun, He, Jing
Format Journal Article
LanguageEnglish
Published Elsevier Inc 01.04.2020
Subjects
Online AccessGet full text
ISSN0020-0255
1872-6291
DOI10.1016/j.ins.2019.11.023

Cover

More Information
Summary:•There is no need to re-visit previous batches of numeric data.•The consumed time and the estimated error of the proposed two schemes, which is stable with the number of data increasing, is much less than traditional method.•Approximate support values of item-sets are proved in this paper, and also testified by synthetic and real datasets to converge when the number of streaming data increase.•Errors of approximate support values of item-sets are also testified to converge at zero when the number of streaming data increase, which means approximate support values converge to their corresponding real support value. Discovering frequent itemsets is essential for finding association rules, yet too computational expensive using existing algorithms. It is even more challenging to find frequent itemsets upon streaming numeric data. The streaming characteristic leads to a challenge that streaming numeric data cannot be scanned repetitively. The numeric characteristic requires that streaming numeric data should be pre-processed into itemsets, e.g., fuzzy-set methods can transform numeric data into itemsets with non-integer membership values. This leads to a challenge that the frequency of itemsets are usually not integer. To overcome such challenges, fast methods and stream processing methods have been applied. However, the existing algorithms usually either still need to re-visit some previous data multiple times, or cannot count non-integer frequencies. Those existing algorithms re-visiting some previous data have to sacrifice large memory spaces to cache those previous data to avoid repetitive scanning. When dealing with big streaming data nowadays, such large-memory requirement often goes beyond the capacity of many computers. Those existing algorithms unable to count non-integer frequencies would be very inaccurate in estimating the non-integer frequencies of frequent itemsets if used with integer approximation of frequency-counting. To solve the aforementioned issues, in this paper we propose two incremental schemes for frequent itemsets discovery that are capable to work efficiently with streaming numeric data. In particular, they are able to count non-integer frequency without re-visiting any previous data. The key of our schemes to the benefits in efficiency is to extract essential statistics that would occupy much less memory than the raw data do for the ongoing streaming data. This grants the advantages of our schemes 1) allowing non-integer counting and thus natural integration with a fuzzy-set discretization method to boost robustness and anti-noise capability for numeric data, 2) enabling the design of a decay ratio for different data distributions, which can be adapted for three general stream models: landmark, damped and sliding windows, and 3) achieving highly-accurate fuzzy-item-sets discovery with efficient stream-processing. Experimental studies demonstrate the efficiency and effectiveness of our dual schemes with both synthetic and real-world datasets.
ISSN:0020-0255
1872-6291
DOI:10.1016/j.ins.2019.11.023