作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 先进计算与数据处理 • 上一篇    下一篇

基于联合非负矩阵分解的话题变迁检测方法

陈梦伟 1,吕钊 1,崔修涛 2   

  1. (1.华东师范大学 计算机科学与技术系,上海 200241; 2.上海长江计算机有限公司,上海 200241)
  • 收稿日期:2016-12-21 出版日期:2018-01-15 发布日期:2017-01-15
  • 作者简介:陈梦伟(1987—),男,硕士研究生,主研方向为大数据分析、知识处理;吕钊,副教授;崔修涛,高级工程师、博士。
  • 基金资助:
    上海市科学技术委员会科研计划项目(16511102702);上海市经济和信息化委员会项目(150643)。

Topic Change Detection Method Based on Joint Nonnegative Matrix Factorization

CHEN Mengwei  1,LU Zhao  1,CUI Xiutao  2   

  1. (1.Department of Computer Science and Technology, East China Normal University, Shanghai 200241, China; 2.Shanghai Changjiang Computer Co.,Ltd.,Shanghai 200241,China)
  • Received:2016-12-21 Online:2018-01-15 Published:2017-01-15

摘要: 在大规模时序文档集中,异同话题缺乏从时序文档集中识别跟踪分析话题随时间变迁的能力。为此,提出一种面向时序文档语料库的话题变迁检测方法。该方法从时序文档语料库中发现相似话题和异同话题。利用改进的联合非负矩阵分解算法,从多个数据集中提取话题集合。为避免引入噪声话题,计算所有话题的话题熵,以获取优质话题,并通过运用词云和趋势图来分析话题变迁趋势。在20Newsgroups和LTN2011数据集上的实验结果表明,该方法可以有效地从时序文档集中发现异同话题,且提取的话题效果好、准确率高。

关键词: 联合非负矩阵分解, 话题模型, 时序异同话题, 优质话题, 话题变迁检测

Abstract: In large-scale temporal documents similarities and differences do not have the ability to identily topics from temporal documents and to track and analyze topics over time.To this end,a method of topic change detection for temporal document corpus is proposed.Similar topics and similarities and foundations are found in the temporal document corpus.Using the improved joint Nonnegative Matrix Factorization(NMF) algorithm,similarities and differences were found in the the timeseries document.To avoid the introduction of noise topics,by calculating the topic of all topic entropy,in order to obtain high-quality topics.Use the word cloud and trend graph to analyze the trend of topic change.Experimental results of two real data sets,20Newsgroups and LTN2011 show that this method can effectively find similarities and differences from the tempord of documents,and the extraction topic is effect and the accuracy is high.

Key words: Joint Nonnegative Matrix Factorization (NMF), topic model, temporal similarities and differences topic, high quality topic, topic change detection

中图分类号: