作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2008, Vol. 34 ›› Issue (10): 83-85. doi: 10.3969/j.issn.1000-3428.2008.10.030

• 软件技术与数据库 • 上一篇    下一篇

基于压缩后缀数组技术的搜索引擎

姚全珠,张 楠,杨增辉,田 元   

  1. (西安理工大学计算机学院,西安 710048)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-05-20 发布日期:2008-05-20

Search Engine Technology Based on Compressed Suffix Array

YAO Quan-zhu, ZHANG Nan, YANG Zeng-hui, TIAN Yuan   

  1. (School of Computer Science, Xi’an University of Technology, Xi’an 710048)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-05-20 Published:2008-05-20

摘要: 目前,搜索引擎的核心模块(索引器)均采用倒排文件结构,对短语查询的准确率较低。该文引入后缀数组技术进行全文索引,为克服全文索引时占用空间大的缺点,研究了压缩后缀数组技术,把后缀数组索引的大小压缩到了O(n)位,并给出应用压缩后缀数组索引的步骤和核心操作伪代码。对比实验表明,基于压缩后缀数组的索引比传统倒排文件索引的短语查准率提高了近20%。

关键词: 压缩后缀数组, 倒排文件, 后缀数组, 搜索引擎

Abstract: The core module of search engines, namely indexer, is usually based on inverted file. But this solution to solve phrase-search is in difficulty(the lower hitting rate). In this paper the Suffix Array(SA) are employed for full-text indexing. In order to overcome the disadvantage of large memory cost as with full-text indexing, research is done for Compressed Suffix Array(CSA). The paper presents the step of using CSA index and the false code of core operate. The experiments show that this technique, compared with inverted file, improves the hitting rate for phrase by 20%.

Key words: Compressed Suffix Array(CSA), inverted file, Suffix Array(SA), search engine

中图分类号: