摘要: 针对Lucene自带的中文分析器分词性能不理想并且难以选择第三方分析器的问题,研究多种基于Lucene的中文分析器,对语句分词、分词速度、建立索引的空间与时间、检索结果以及检索速度等方面进行比较。分析结果表明,在Lucene框架下,基于词典分词的Paoding分析器总体性能最优,Lucene自带的一元分析器分词速度最快,imdict与ICTCLAS4J分析器在算法效率上存在一定改进空间。
关键词:
Lucene框架,
搜索引擎,
中文分词,
分析器,
分词速度,
索引,
检索
Abstract: The segmentation performance on Chinese analyzer of Lucene is insufficient, and the third party analyzer is difficult to choose. Because of this problem, this paper introduces several kinds of support Lucene analyzer, based on the experiment, sentence segmentation, word segmentation speed, index space and time, retrieval results and speed of retrieval are compared and researched. Analysis results show that, in Lucene framework, Paoding analyzer based on dictionary segmentation has the best overall performance, one-word analyzer of Lucene has the highest segmentation speed, imdict and ICTCLAS4J analyzer have greater room for improvement on the algorithm efficiency.
Key words:
Lucene framework,
search engine,
Chinese segmentation,
analyzer,
segmentation speed,
index,
retrieval
中图分类号:
义天鹏, 陈启安. 基于Lucene的中文分析器分词性能比较研究[J]. 计算机工程, 2012, 38(22): 279-282.
XI Tian-Feng, CHEN Qi-An. Comparison Research of Segmentation Performance for Chinese Analyzers Based on Lucene[J]. Computer Engineering, 2012, 38(22): 279-282.