摘要: 构造了多层Count-Min概要数据结构来概括流数据中的层次结构。通过定义多层数据域U*上两两相互独立的异或哈希函数族,将数据流元组映射到L×D×W的三维计数数组,L是层次个数,D是从哈希函数族中均匀随机选取的哈希函数个数,W是哈希函数的值域。基于该结构,利用广度优先查询策略,查找多层频繁项集和估计多层频繁项值。实验表明,该结构在更新时间、存储空间和估计精度方面比直接堆叠多个Count-Min结构有较大的提高。
关键词:
数据流,
概要数据结构,
频繁项集,
随机算法,
多层结构
Abstract: A hierarchical Count-Min (HCM) sketch data structure is implemented to summarize the hierarchical structure in stream data. By designing and defining a XOR-based pair-wise independent family of Hash functions on the hierarchical domain U*, the sketch maps stream data items to a three dimensional array of counters of size L×D×W. Of the counter array, L is the number of layers in hierarchy, D is the number of uniformly and randomly chosen Hash functions, and W is the range of the Hash functions. The HCM sketch uses breadth-first search strategy to identify and evaluate the hierarchical frequent itemsets. The experiments demonstrate the improvement in update time, space usage and evaluating accuracy over the straightforward CM-based approach.
Key words:
data stream,
sketch data structure,
frequent itemset,
random algorithm,
hierarchy
中图分类号:
冯文峰;郭 巧;关志涛;张治斌. 面向数据流的多层Count-Min概要数据结构[J]. 计算机工程, 2007, 33(14): 20-23.
FENG Wenfeng; GUO Qiao; GUAN Zhitao; ZHANG Zhibin. Hierarchical Count-Min Sketch Data Structure for Data Streams[J]. Computer Engineering, 2007, 33(14): 20-23.