计算机工程

• 先进计算与数据处理 • 上一篇    下一篇

基于热度矩阵的微博热点话题发现

聂文汇 1,曾承 1,贾大文 2   

  1. (1.武汉大学 软件工程国家重点实验室,武汉 430079; 2.中国电子科技集团公司第二十八研究所,北京 100010)
  • 收稿日期:2016-01-18 出版日期:2017-02-15 发布日期:2017-02-15
  • 作者简介:聂文汇(1991—),男,硕士,主研方向为数据挖掘、服务计算;曾承,副教授、博士后;贾大文,博士。
  • 基金项目:
    国家自然科学基金重点项目(U1135005)。

Microblog Hot Topics Detection Based on Heat Matrix

NIE Wenhui  1,ZENG Cheng  1,JIA Dawen  2   

  1. (1.State Key Laboratory of Software Engineering,Wuhan University,Wuhan 430079,China;2.The 28th Research Institute of China Electronics Technology Group Corporation,Beijing 100010,China)
  • Received:2016-01-18 Online:2017-02-15 Published:2017-02-15

摘要: 现有微博热点话题发现模型对微博数量规模较敏感,发现速度较慢。为此,提出一种基于热度矩阵的主题模型。通过热度矩阵获取各潜在主题的热度和主题-词概率分布,并以词间的共有热度来挖掘其语义关系,进而准确识别数据中的热点话题及热点词汇。在真实微博数据上的实验结果表明,与潜在狄利克雷分布模型相比,该模型的效率和准确率较高,发现的热点话题与实时事件保持一致,具有较好的热点识别效果。

关键词: 热度矩阵, 主题模型, 微博, 话题发现, 文本挖掘

Abstract: Existing methods or models of microblog hot topics detection are sensitive to the quantity and the scale of microblog,and the detection process is slow.Hence,this paper proposes a topic model based on heat matrix.It uses the heat matrix to obtain heat and the topic-word probability distribution of every latent topic,and uses the common heat of words to extract the semantic relationship between words.Then the hot topics and hot words can be identified accurately.Experimental results on real microblog show that,compared with Latent Dirichlet Allocation(LDA) model,the proposed model has higher efficiency and accuracy rate.It can detect the hot topics which are consistent with real-time events,so that it has better effect in hot spot identification.

Key words: heat matrix, topic model, microblog, topic detection, text mining

中图分类号: