基于计数型布隆过滤器的文本检索模型

doi:10.3969/j.issn.1000-3428.2014.02.013

计算机工程

基于计数型布隆过滤器的文本检索模型

冯加军，王晓琳，田青

(山东大学计算机科学与技术学院，济南 250101)

收稿日期:2012-12-28 出版日期:2014-02-15 发布日期:2014-02-13
作者简介:冯加军(1981－)，男，工程师、硕士研究生、CCF会员，主研方向：信息处理；王晓琳，副教授；田青，高级工程师
基金资助:
山东省自然科学基金资助项目(ZR2009GM021)

Text Retrieval Model Based on Counting Bloom Filter

FENG Jia-jun, WANG Xiao-lin, TIAN Qing

(College of Computer Science and Technology, Shandong University, Jinan 250101, China)

Received:2012-12-28 Online:2014-02-15 Published:2014-02-13

摘要/Abstract

摘要： 分布式文本检索系统难以兼顾高效率的数据检索和低成本的索引维护。为此，提出一种基于计数型布隆过滤器的文本检索模型CBFTRM。该模型将物理节点分为数据节点和索引节点，分别采用结构化P2P进行网络覆盖。每个数据节点负责存储文档数据并维护与之相应的倒排索引，同时通过倒排索引中的关键词集合计算出计数型布隆过滤器值，发送给相应的索引节点。每个索引节点建立一棵以部分数据节点的特征信息(包括过滤器值)为叶节点、以过滤器值运算结果为内部节点的搜索树，并在叶节点发生变化时对搜索树进行维护。仿真实验结果表明，该模型文档定位快，索引维护通信量小，而且具有较高的查准率。

关键词: 计数型布隆过滤器, 搜索树, 结构化P2P, 文本检索, 倒排索引

Abstract: The distributed text retrieval system is difficult to take both high retrieval efficiency and low cost of index maintenance into account, so this paper proposes a Text Retrieval Model based on Counting Bloom Filter(CBFTRM) to solve the problems above. This model divides the physical node into the data node and the index node, both of which are overlaid with structured P2P network. Each data node is responsible for storing documents, and maintaining the inverted index of the documents. It also transmits the values of Counting Bloom Filter(CBF) which are computed by the inverted index’s keywords to the corresponding index node. Each index node builds a search tree and maintains it when the tree’s leaf node changes. The search tree is built by leaf nodes with the data node’s character(including their counting bloom filter’s value), and its internal nodes with the result computed by the values of counting bloom filter. Simulation result shows that this model locates the document faster, and has less traffic doing index maintenance and higher precision.

Key words: Counting Bloom Filter(CBF), search tree, structured P2P, text retrieval, inverted index

中图分类号:

TP311.13

冯加军，王晓琳，田青. 基于计数型布隆过滤器的文本检索模型[J]. 计算机工程.

FENG Jia-jun, WANG Xiao-lin, TIAN Qing. Text Retrieval Model Based on Counting Bloom Filter[J]. Computer Engineering.

https://www.ecice06.com/CN/Y2014/V40/I2/58

参考文献

参考文献 [1] 史庆伟, 许光全, 王新海. 结构化P2P网络文本检索研究[J].计算机工程, 2010, 36(12): 43-45. [2] 赵显亮. 基于小世界的P2P文本检索研究[D]. 西安: 西安电子科技大学, 2011. [3] 侯祥松, 曹元大, 关志涛, 等. 基于结构化P2P的语义查询技术[J]. 电子与信息学报, 2009, 31(3): 707-710. [4] Burton B. Space/Time Trade-offs in Hash Coding with Allow- able Errors[J]. Communications of the ACM, 1970, 13(7): 422-426. [5] Wang Shiguo, Ji Hong, Li Yi. BF-chord: An Improved Look- up Protocol to Chord Based on Bloom Filter for Wireless P2P[C]//Proc. of the 5th International Conference on Wireless Communications, Networking and Mobile Computing. Beijing, China: [s. n.], 2009. [6] Sato F, Wakabayashi S. Bloom Filters Based on the B-tree[C]// Proc. of International Conference on Complex, Intelligent and Software Intensive Systems. Fukuoka, Japan: [s. n.]. 2009. [7] Fan Li, Cao Pei, Almeida J. Summary Cache: A Scalable Wide-area Web Cache Sharing Protocol[J]. IEEE/ACM Transactions on Networking, 2000, 8(3): 281-293. [8] Sameh E A, Seif H. An Overview of Structured P2P Overlay Networks[D]. [S. l.]: Swedish Institute of Computer Science, 2004. [9] Zobel J, Moffat A, Ramamohanarao K. Inverted Files Versus Signature Files for Text Indexing[J]. ACM Transactions on Database Systems, 1998, 23(4): 453-490. [10] Stoica I, Morris R, Karger D, et al. Chord: A Scalable Peer to Peer Lookup Service for Internet Applications[C]//Proc. of SIGCOMM’01. San Diego, USA: [s. n.], 2001. [11] Google Inc.. Google Project Hosting[EB/OL]. (2012-11-20). http://code.google.com/p/mmseg4j. [12] Broder A, Mitzenmacher M. Network Applications of Bloom Filters: A Survey[J]. Internet Mathematics, 2002, 1(4): 485- 509. 编辑任吉慧

[1]	白梅, 苌仕涵, 王习特. 基于位置的路网Skyline查询处理研究[J]. 计算机工程, 2022, 48(1): 127-134.
[2]	蔡荣彦, 王鹤, 姚启桂, 何高峰. 基于域名关联的恶意移动应用检测研究[J]. 计算机工程, 2020, 46(5): 174-180.
[3]	翟金凤,孙立博,鲁凯,林学勇,秦文虎. 基于Counting Bloom Filter的流抽样算法研究[J]. 计算机工程, 2018, 44(8): 273-278.
[4]	宋巧红,齐金鹏,张煜. 基于多级Haar小波变换与KS统计的突变点快速探测方法[J]. 计算机工程, 2018, 44(5): 14-18,24.
[5]	汪昀,朱明,冯伟国. 一种支持海量人脸图片快速检索的索引结构[J]. 计算机工程, 2015, 41(3): 186-190.
[6]	石敏，赵文栋，张磊. 一种基于本体划分的语义Web服务发现算法[J]. 计算机工程, 2014, 40(2): 175-179.
[7]	张旭东，孙志明，刘亚宁，单栋栋，闫宏飞. 基于64位体系结构的倒排索引压缩算法[J]. 计算机工程, 2014, 40(2): 71-76.
[8]	方爽,殷俊杰,徐武平. 基于相似图片聚类的Web文本特征算法[J]. 计算机工程, 2014, 40(12): 161-165,171.
[9]	蔡偃武,高大启,阮彤,蒋锐权. 面向大规模数据的在线新事件检测[J]. 计算机工程, 2014, 40(10): 37-42.
[10]	俞芳, 崔少彬, 高振彦. 基于镜头时长和纹理信息的视频拷贝检测[J]. 计算机工程, 2013, 39(2): 304-310.
[11]	李景博, 刘金刚, 邢云冰. 一种改进的业务服务选取方法[J]. 计算机工程, 2012, 38(9): 36-39.
[12]	长孙妮妮, 张毅坤, 华灯鑫, 邹子夏, 陈浩. 一种基于B+树的混合索引结构[J]. 计算机工程, 2012, 38(14): 35-37.
[13]	罗卫敏, 熊江, 应宏, 刘井波, 陈晓峰. 物联网中基于两层P2P结构的ONS模型[J]. 计算机工程, 2012, 38(12): 79-81.
[14]	郭剑峰, 陈潇君, 柯佳, 陈祖爵. 具有多维特征的WSN路由协议研究[J]. 计算机工程, 2011, 37(18): 103-105.
[15]	赵珂, 逯鹏, 李永强. 基于Lucene的搜索引擎设计与实现[J]. 计算机工程, 2011, 37(16): 39-41.

选择文件类型/文献管理软件名称

选择包含的内容

基于计数型布隆过滤器的文本检索模型

Text Retrieval Model Based on Counting Bloom Filter

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于计数型布隆过滤器的文本检索模型

Text Retrieval Model Based on Counting Bloom Filter

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价