作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2008, Vol. 34 ›› Issue (9): 64-66. doi: 10.3969/j.issn.1000-3428.2008.09.023

• 软件技术与数据库 • 上一篇    下一篇

基于压缩倒排文件的中文全文检索仿真系统

宋 懿,国德峰   

  1. (上海交通大学计算机科学与工程系,上海 200240)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-05-05 发布日期:2008-05-05

Chinese Full-text Retrieval Simulation System Based on Compressed Inverted File

SONG Yi, GUO De-feng   

  1. (Department of Computer Science & Engineering, Shanghai Jiaotong University, Shanghai 200240)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-05-05 Published:2008-05-05

摘要: 探讨基于压缩倒排文件的中文全文检索技术,包括数据压缩方法、存储、检索与排名机制。借助中科院的高精度ICTCLAS中文分词系统,采用C++/STL语言仿真实现了一个中文全文检索系统。该文列出部分关键代码,利用搜狗实验室提供的数据进行实验。通过改进压缩算法,系统的磁盘利用率提高了近80%。

关键词: 中文全文检索, 压缩倒排文件, 排名

Abstract: This paper analyzes Chinese full-text retrieval technologies based on compressed inverted file, including data compression, file storage, searching and ranking mechanisms. A Chinese text retrieval simulation system is implemented in C++/STL with ICTCLAS, which is a high precision Chinese segmentation tool from CAS. Some key codes are also included, and an experiment is carried using data provided by the Sogou Lab. The system disk utilization goes up nearly 80% through using improved compression algorithm.

Key words: Chinese full-text retrieval, compressed inverted file, ranking

中图分类号: