作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2017, Vol. 43 ›› Issue (12): 184-191. doi: 10.3969/j.issn.1000-3428.2017.12.034

• 人工智能及识别技术 • 上一篇    下一篇

基于词嵌入与概率主题模型的社会媒体话题识别

余冲,李晶,孙旭东,傅向华   

  1. (深圳大学 计算机与软件学院,广东 深圳 518000)
  • 收稿日期:2016-11-04 出版日期:2017-12-15 发布日期:2017-12-15
  • 作者简介:余冲(1991—),男,硕士,主研方向为数据挖掘、话题识别;李晶、孙旭东,硕士;傅向华,教授、博士。
  • 基金资助:
    国家自然科学基金(61472258);深圳市基础研究计划项目(JCYJ20140509172609162)。

Social Media Topic Recognition Based on Word Embedding and Probabilistic Topic Model

YU Chong,LI Jing,SUN Xudong,FU Xianghua   

  1. (College of Computer Science and Software Engineering,Shenzhen University,Shenzhen,Guangdong 518000,China)
  • Received:2016-11-04 Online:2017-12-15 Published:2017-12-15

摘要: 词嵌入技术能从大语料库中捕获词语的语义信息,将其与概率主题模型结合可解决标准主题模型缺乏语义信息的问题。为此,同时对词嵌入和主题模型进行改进,构建词-主题混合模型。在主题词嵌入(TWE)模型中引入外部语料库获得初始主题和单词表示,通过定义主题向量和词嵌入的条件概率分布,将词嵌入特征表示和主题向量集成到主题模型中,同时最小化新词-主题分布函数和原始词-主题分布函数的KL散度。实验结果表明,与Word2vec、TWE、LDA和LFLDA模型相比,该模型在词表示和主题检测方面性能更好。

关键词: 社会媒体, 话题检测, 特征表示, 词嵌入, 话题模型, 词-主题混合模型

Abstract: Word embedding can capture the semantic information of words from the large corpus,and its combination with the probabilistic topic model can solve the problem of lack of semantic information in the standard topic model.So in this paper,Word-Topic Mixture(WTM) model is proposed to improve word representation and topic model simultaneously.Firstly,external corpus is introduced into the Topic Word Embedding(TWE) model to get the initial topic and word representation.Then the word embedding feature representation and topic vector are integrated in the topic model by redefining the probability conditional distribution of topic vectors and word embedding,meanwhile the KL divergence of the new word-topic distribution function and the original distribution function are minimized.Experimental results prove that the WTM model performs better on word representation and topic detection compared with Word2vec,TWE,Latent Dirichlet Allocation(LDA) and LFLDA model.

Key words: social media, topic detection, feature expression, word embedding, topic model, Word-Topic Mixture(WTM) model

中图分类号: