摘要: 目前,搜索引擎的核心模块(索引器)均采用倒排文件结构,对短语查询的准确率较低。该文引入后缀数组技术进行全文索引,为克服全文索引时占用空间大的缺点,研究了压缩后缀数组技术,把后缀数组索引的大小压缩到了O(n)位,并给出应用压缩后缀数组索引的步骤和核心操作伪代码。对比实验表明,基于压缩后缀数组的索引比传统倒排文件索引的短语查准率提高了近20%。
关键词:
压缩后缀数组,
倒排文件,
后缀数组,
搜索引擎
Abstract: The core module of search engines, namely indexer, is usually based on inverted file. But this solution to solve phrase-search is in difficulty(the lower hitting rate). In this paper the Suffix Array(SA) are employed for full-text indexing. In order to overcome the disadvantage of large memory cost as with full-text indexing, research is done for Compressed Suffix Array(CSA). The paper presents the step of using CSA index and the false code of core operate. The experiments show that this technique, compared with inverted file, improves the hitting rate for phrase by 20%.
Key words:
Compressed Suffix Array(CSA),
inverted file,
Suffix Array(SA),
search engine
中图分类号:
姚全珠;张 楠;杨增辉;田 元. 基于压缩后缀数组技术的搜索引擎[J]. 计算机工程, 2008, 34(10): 83-85.
YAO Quan-zhu; ZHANG Nan; YANG Zeng-hui; TIAN Yuan. Search Engine Technology Based on Compressed Suffix Array[J]. Computer Engineering, 2008, 34(10): 83-85.