作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (14): 20-23. doi: 10.3969/j.issn.1000-3428.2007.14.007

• 博士论文 • 上一篇    下一篇

面向数据流的多层Count-Min概要数据结构

冯文峰1,2,郭 巧1,关志涛1,张治斌2   

  1. (1. 北京理工大学网络信息中心,北京 100081;2. 河南理工大学计算机科学与技术学院,焦作 454000)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-07-20 发布日期:2007-07-20

Hierarchical Count-Min Sketch Data Structure for Data Streams

FENG Wenfeng1,2, GUO Qiao1, GUAN Zhitao1, ZHANG Zhibin2   

  1. (1. Network Information Center, Beijing Institute of Technology, Beijing 100081; 2. Computer Science and Technology School, Henan Polytechnic University, Jiaozuo 454000 )
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-07-20 Published:2007-07-20

摘要: 构造了多层Count-Min概要数据结构来概括流数据中的层次结构。通过定义多层数据域U*上两两相互独立的异或哈希函数族,将数据流元组映射到L×D×W的三维计数数组,L是层次个数,D是从哈希函数族中均匀随机选取的哈希函数个数,W是哈希函数的值域。基于该结构,利用广度优先查询策略,查找多层频繁项集和估计多层频繁项值。实验表明,该结构在更新时间、存储空间和估计精度方面比直接堆叠多个Count-Min结构有较大的提高。

关键词: 数据流, 概要数据结构, 频繁项集, 随机算法, 多层结构

Abstract: A hierarchical Count-Min (HCM) sketch data structure is implemented to summarize the hierarchical structure in stream data. By designing and defining a XOR-based pair-wise independent family of Hash functions on the hierarchical domain U*, the sketch maps stream data items to a three dimensional array of counters of size L×D×W. Of the counter array, L is the number of layers in hierarchy, D is the number of uniformly and randomly chosen Hash functions, and W is the range of the Hash functions. The HCM sketch uses breadth-first search strategy to identify and evaluate the hierarchical frequent itemsets. The experiments demonstrate the improvement in update time, space usage and evaluating accuracy over the straightforward CM-based approach.

Key words: data stream, sketch data structure, frequent itemset, random algorithm, hierarchy

中图分类号: