作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于吉布斯采样结果的主题文本网络构建方法

张志远,杨宏敬,赵越   

  1. (中国民航大学 计算机科学与技术学院,天津 300300)
  • 收稿日期:2016-08-22 出版日期:2017-06-15 发布日期:2017-06-15
  • 作者简介:张志远(1978—),男,副教授,主研方向为文本挖掘、数据仓库、复杂网络;杨宏敬、赵越,硕士研究生。
  • 基金资助:
    国家自然科学基金(61201414);中央高校基本科研业务费专项资金(3122016D021)。

Topical Text Network Construction Method Based on Gibbs Sampling Results

ZHANG Zhiyuan,YANG Hongjing,ZHAO Yue   

  1. (School of Computer Science and Technology,Civil Aviation University of China,Tianjin 300300,China)
  • Received:2016-08-22 Online:2017-06-15 Published:2017-06-15

摘要: 挖掘文档集合中主题词的概率分布可对文档内容做概要性了解。进一步探寻给定主题下单词之间的连接关系不仅能丰富主题词的含义,而且能更细致地表现主题的层次和聚集关系。为此,针对带标签的文档集合,基于标注潜在狄利克雷分布(LDA)分析后的吉布斯采样结果,提出一种给定主题下2个单词共现的概率计算方法,并在此基础上构建主题文本网络。与逐点标注LDA(PL-LDA)模型相比,该方法不扩充原始文件,计算量小,耗时短。在航空安全报告数据集上的实验结果表明,对标记单词较多的主题,该方法能够较好地展示主题词的分布情况以及它们之间的复杂联系。

关键词: 主题模型, 文本网络, 吉布斯采样, 潜在狄利克雷分布, 航空安全报告

Abstract: Mining the probability distribution of topic words in document collection can make a summary understanding of the document content.Further exploring the connection relationship between words in a given topic not only riches the meaning of topic words,but also reveals the hierarchy and aggregation of topics.For the labeled document collection,this paper proposes a method to compute the conditional probability of two words under a given topic based on Gibbs sampling outputs of labeled Latent Dirichlet Allocation(LDA),and a topical text network is also constructed.Compared with Pointwise Labeled-LDA(PL-LDA) model,this method does not extend the original document and needs less computation cost and shorter time.Experiments on the data set of aviation safety reports show that,for topics with many labeled words,this method can better display the distribution of subject words and the complex relationship between them.

Key words: topic model, text network, Gibbs sampling, Latent Dirichlet Allocation(LDA), aviation safety report

中图分类号: