作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

一种面向词汇突发的连续时间主题模型

刘良选,黄梦醒   

  1. (海南大学 信息科学技术学院,海口 570228)
  • 收稿日期:2015-09-29 出版日期:2016-11-15 发布日期:2016-11-15
  • 作者简介:刘良选(1987—),男,硕士研究生,主研方向为文本挖掘、机器学习;黄梦醒,教授、博士生导师。
  • 基金资助:
    国家自然科学基金(61462022)。

A Continuous-time Topic Model for Word Burstiness

LIU Liangxuan,HUANG Mengxing   

  1. (College of Information Science and Technology,Hainan University,Haikou 570228,China)
  • Received:2015-09-29 Online:2016-11-15 Published:2016-11-15

摘要: 针对传统基于多项式分布的主题模型不能较好地刻画文档中词汇突发的现象,综合考虑文本集固有的时间信息,提出一种面向词汇突发的Dirichlet组合多项式(DCM)连续时间主题模型。采用DCM分布对文本集中的词汇突发现象进行建模,利用Beta分布刻画文本集中的时间特 征,通过Gibbs采样和不动点迭代法实现模型参数的估计。实验结果表明,在预设主题数目较少的情况下,与ToT和DCMLDA模型相比,该模型具有明显的泛化性能优势,并且可以有效揭示出文本集中潜在的主题演化趋势。

关键词: 主题模型, 潜在Dirichlet分配, 词汇突发, Dirichlet组合多项式, Gibbs采样, 不动点迭代法

Abstract: To solve the problem that traditional topic models based on multinomial distribution cannot properly capture the condition of word burstiness,a continuous-time topic model with Dirichlet Compound Multinomial(DCM) for word burstiness is proposed,which integrates inherent temporal information in the corpus.In this model,the phenomenon of word burstiness is modeled by DCM distribution,while temporal features are characterized by Beta distribution.Gibbs sampling and fixed -point iteration method are employed to estimate the parameters in the model.Experimental results demonstrate that the model has obvious advantages over ToT and DCMLDA in terms of generalization performance when the given number of topics is small,and it can also effectively reveal the latent evolutions of topics in the corpus.

Key words: topic model, Latent Dirichlet Allocation(LDA), word burstiness, Dirichlet Compound Multinomial(DCM), Gibbs sampling, fixed-point iteration method

中图分类号: