作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2011, Vol. 37 ›› Issue (16): 39-41. doi: 10.3969/j.issn.1000-3428.2011.16.013

• 软件技术与数据库 • 上一篇    下一篇

基于Lucene的搜索引擎设计与实现

赵 珂,逯 鹏,李永强   

  1. ZHAO Ke, LU Peng, LI Yong-qiang
  • 收稿日期:2011-02-18 出版日期:2011-08-20 发布日期:2011-08-20
  • 作者简介:赵 珂(1988-),男,本科生,主研方向:软件工程,Web信息处理,数据挖掘;逯 鹏(通讯作者),副教授、博士;李永强,硕士研究生
  • 基金资助:
    国家自然科学基金资助项目(60841004, 60971110);郑州大学创新性实验基金资助项目(2009cxsy100)

Design and Implementation of Search Engine Based on Lucene

ZHAO Ke, LU Peng, LI Yong-qiang   

  1. (School of Electrical Engineering, Zhengzhou University, Zhengzhou 450001, China)
  • Received:2011-02-18 Online:2011-08-20 Published:2011-08-20

摘要: 针对目前教育网庞大的FTP资源检索困难的问题,提出一种基于EdtFTPJ和Lucene的FTP搜索引擎的设计和实现方案。该方案整体上采用基于Struts1.2框架的模型-视图-控制器设计模式,数据采集模块利用基于正则表达式的有限状态自动机抓取数据,索引模块应用倒排索引方法,系统的分词算法使用基于字典的正向最大匹配中文分词法。实验结果表明,该方案具有较高的资源检索率,同时能够保证检索结果的准确性。

关键词: FTP搜索引擎, Lucene框架, 模型-视图-控制器, 有限状态自动机, 倒排索引

Abstract: The number of File Transfer Protocol(FTP) resources on the China Education and Research Network(CERNET) is quite large. It is difficult to find the resources. Because of this problem, a high-performance FTP search engine is designed based on EdtFTPJ and Lucene. In this engine, Struts1.2 is employed to implement Model View Controller(MVC). Data acquisition module uses finite state machine based on regular expression to grab information. Index module uses inverted index method. Word segmentation algorithm uses maximally match Chinese words segmentation based on dictionary. Query Experimental results indicate that the proposed scheme improves the query efficiency, at the same time to ensure the accuracy of the retrieval results.

Key words: File Transfer Protocol(FTP) search engine, Lucene framework, Model View Controller(MVC), finite state automata, inverted index

中图分类号: