作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2010, Vol. 36 ›› Issue (19): 81-83. doi: 10.3969/j.issn.1000-3428.2010.19.028

• 软件技术与数据库 • 上一篇    下一篇

基于LDA模型的主题词抽取方法

石 晶1,李万龙1,2   

  1. (1. 长春工业大学计算机科学与工程学院,长春 130012; 2. 吉林大学计算机科学与技术学院,长春 130012)
  • 出版日期:2010-10-05 发布日期:2010-09-27
  • 作者简介:石 晶(1970-),女,讲师、博士,主研方向:中文信息处理;李万龙,教授
  • 基金资助:
    长春工业大学博士基金资助项目(2008A02)

Topic Words Extraction Method Based on LDA Model

SHI Jing1, LI Wan-long1,2   

  1. (1. College of Computer Science and Engineering, Changchun University of Technology, Changchun 130012, China; 2. College of Computer Science and Technology, Jilin University, Changchun 130012, China)
  • Online:2010-10-05 Published:2010-09-27

摘要: 以LDA模型表示文本词汇的概率分布,通过香农信息抽取体现主题的关键词。采用背景词汇聚类及主题词联想的方式将主题词扩充到待分析文本之外,尝试挖掘文本的主题内涵。模型拟合基于快速Gibbs抽样算法进行。实验结果表明,快速Gibbs算法的速度约比传统Gibbs算法高5倍,准确率和抽取效率均较高。

关键词: LDA模型, Gibbs抽样, 主题词抽取

Abstract: Latent Dirichlet Allocation(LDA) is presented to express the distributed probability of words. The topic keywords are extracted according to Shannon information. Words which are not distinctly in the analyzed text can be included to express the topics with the help of word clustering of background and topic words association. The topic meaning is attempted to dig out. Fast Gibbs is used to estimate the parameters. Experiments show that Fast Gibbs is 5 times faster than Gibbs and the precision is satisfactory, which shows the approach is efficient.

Key words: Latent Dirichlet Allocation(LDA) model, Gibbs sampling, extraction of topic words

中图分类号: