计算机工程 ›› 2012, Vol. 38 ›› Issue (22): 279-282.doi: 10.3969/j.issn.1000-3428.2012.22.070

• 开发研究与设计技术 • 上一篇    下一篇

基于Lucene的中文分析器分词性能比较研究

义天鹏,陈启安   

  1. (厦门大学计算机科学系,福建 厦门 361005)
  • 收稿日期:2011-12-18 修回日期:2012-03-19 出版日期:2012-11-20 发布日期:2012-11-17
  • 作者简介:义天鹏(1987-),男,硕士研究生,主研方向:Web信息处理,搜索引擎,软件工程;陈启安,教授
  • 基金项目:
    航空科学基金资助项目(20085568013)

Comparison Research of Segmentation Performance for Chinese Analyzers Based on Lucene

YI Tian-peng, CHEN Qi-an   

  1. (Department of Computer Science, Xiamen University, Xiamen 361005, China)
  • Received:2011-12-18 Revised:2012-03-19 Online:2012-11-20 Published:2012-11-17

摘要: 针对Lucene自带的中文分析器分词性能不理想并且难以选择第三方分析器的问题,研究多种基于Lucene的中文分析器,对语句分词、分词速度、建立索引的空间与时间、检索结果以及检索速度等方面进行比较。分析结果表明,在Lucene框架下,基于词典分词的Paoding分析器总体性能最优,Lucene自带的一元分析器分词速度最快,imdict与ICTCLAS4J分析器在算法效率上存在一定改进空间。

关键词: Lucene框架, 搜索引擎, 中文分词, 分析器, 分词速度, 索引, 检索

Abstract: The segmentation performance on Chinese analyzer of Lucene is insufficient, and the third party analyzer is difficult to choose. Because of this problem, this paper introduces several kinds of support Lucene analyzer, based on the experiment, sentence segmentation, word segmentation speed, index space and time, retrieval results and speed of retrieval are compared and researched. Analysis results show that, in Lucene framework, Paoding analyzer based on dictionary segmentation has the best overall performance, one-word analyzer of Lucene has the highest segmentation speed, imdict and ICTCLAS4J analyzer have greater room for improvement on the algorithm efficiency.

Key words: Lucene framework, search engine, Chinese segmentation, analyzer, segmentation speed, index, retrieval

中图分类号: