作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 先进计算与数据处理 • 上一篇    下一篇

基于话题标签的微博主题挖掘

李 敬,印 鉴,刘少鹏,潘雅丽   

  1. (中山大学信息科学与技术学院计算机科学系,广州510006)
  • 收稿日期:2014-04-29 出版日期:2015-04-15 发布日期:2015-04-15
  • 作者简介:李 敬(1988 - ),男,硕士研究生,主研方向:文本挖掘,机器学习;印 鉴(通讯作者),教授、博士;刘少鹏,博士;潘雅丽,硕士 研究生。
  • 基金资助:
    国家自然科学基金资助项目(61033010,61272065);广东省自然科学基金资助项目(S2011020001182,S2012010009311);广东 省科技计划基金资助项目(2011B040200007,2012A010701013)。

Microblog Topic Mining Based on Hashtag

LI Jing,YIN Jian,LIU Shaopeng,PAN Yali   

  1. (Department of Computer Science,School of Information Science and Technology, Sun Yat-sen University,Guangzhou 510006,China)
  • Received:2014-04-29 Online:2015-04-15 Published:2015-04-15

摘要: 随着互联网的发展,微博已成为人们获取信息的主要平台,为从海量微博中挖掘出有价值的主题信息,结合微博中的会话、转发和话题标签,将微博划分为用户兴趣、用户互动和话题微博3 类,提出基于作者主题模型(ATM)的话题标签主题模型HC-ATM,使用Gibbs 抽样法对模型进行推导,获取微博主题结构。在Twitter 数据集上的实验结果表明,与ATM 模型和基于潜在狄利克雷分布的微博生成模型相比,HC-ATM 模型的主题困惑度更 小、差异度更大,并且能有效挖掘出不同微博类型的主题分布。

关键词: 主题挖掘, 微博, 社交网络, 话题标签主题模型, 作者主题模型

Abstract: With the development of the Internet,microblog has become a major platform for people to obtain the information. In order to mine useful topic from microblog,based on the futures of microblog that having conversation tags,retweet tags and hashtags,this paper divides microblog into three kinds. They are microblogs about users’ interest, users interaction and hashtag-related. It designs a novel hashtag topic model named Hashtag Conversation Author Topic Model(HC-ATM) based on Author Topic Model(ATM),and uses Gibbs sampling implementation for inference of this model. Experiments on Twitter dataset show that HC-ATM outperforms the ATM and MicroBlog Latent Dirichlet Allocation(MB-LDA) in terms of both perplexity and KL-divergence. Besides,HC-ATM can mine topic distribution of different kinds of microblog effectively.

Key words: topic mining, microblog, social network, hashtag topic model, Author Topic Model(ATM)

中图分类号: