作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2008, Vol. 34 ›› Issue (2): 199-201. doi: 10.3969/j.issn.1000-3428.2008.02.066

• 人工智能及识别技术 • 上一篇    下一篇

基于概率潜在语义分析的中文信息检索

罗 景,涂新辉   

  1. (武汉科技大学计算机学院,武汉 430065)

  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-01-20 发布日期:2008-01-20

Chinese Information Retrieval Based on Probabilistic Latent Semantic Analysis

LUO Jing, TU Xin-hui   

  1. (School of Computer Science, Wuhan University of Science and Technology, Wuhan 430065)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-01-20 Published:2008-01-20

摘要: 传统的信息检索模型把词看作孤立的单元,没有考虑自然语言中存在大量的同义词、多义词现象,对召回率和准确率有不利的影响。概率潜在语义模型使用统计的方法建立“文档-潜在语义-词”之间概率分布关系并利用这种关系进行检索。该文将概率潜在语义模型用于中文信息检索,实验结果表明,概率潜在语义模型相对于传统的向量空间模型能够显著地提高检索的平均精度。

关键词: 概率潜在语义分析, 中文信息检索, 索引技术, 关键词抽取

Abstract: In traditional information retrieval models, index word is regarded as independent unit. However, there are many synonyms and polysemy in natural language, and the existence of them deteriorate the recall and precision respectively. Probabilistic latent semantic analysis is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. This paper applies the approach to Chinese information retrieval task. Experimental results indicate that the model based on probabilistic latent semantic analysis hss a prominent higher effectiveness than vector space model.

Key words: probabilistic latent semantic analysis, Chinese information retrieval, index strategies, key phrase extraction

中图分类号: