计算机工程 ›› 2019, Vol. 45 ›› Issue (12): 171-175.doi: 10.19678/j.issn.1000-3428.0053019

• 人工智能及识别技术 • 上一篇    下一篇

基于Spark的学术研究热点挖掘方法

张聪, 易秀双, 朱明浩, 王兴伟   

  1. 东北大学 计算机科学与工程学院, 沈阳 110819
  • 收稿日期:2018-10-29 修回日期:2019-01-08 发布日期:2019-01-29
  • 作者简介:张聪(1991-),男,硕士研究生,主研方向为数据处理、网络安全;易秀双(通信作者),教授、博士;朱明浩,硕士研究生;王兴伟,教授、博士生导师。
  • 基金项目:
    国家自然科学基金(61572123);辽宁省高校创新团队支持计划项目(LT2016007);赛尔网络下一代互联网技术创新项目(NGII20160616)。

Mining Method of Academic Research Hotspot Based on Spark

ZHANG Cong, YI Xiushuang, ZHU Minghao, WANG Xingwei   

  1. School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
  • Received:2018-10-29 Revised:2019-01-08 Published:2019-01-29

摘要: 通过优化Spark MLlib机器学习库中的隐含狄利克雷分布(LDA)主题模型,提出一种改进的学术研究热点挖掘方法。采用LDA主题模型对学术论文关键词进行建模,利用困惑度确定主题模型的最佳主题个数,并将文档-主题和主题-词概率分布矩阵转化为文档-主题和主题-词评分矩阵。通过计算背景主题与评分矩阵中各主题之间的相似度对主题进行排序,挖掘出学术论文中的研究热点。实验结果表明,该方法能提高LDA主题模型的挖掘效果,有助于发现有价值的学术研究热点主题。

关键词: 学术论文, 隐含狄利克雷分布, 背景主题, 主题排序, 研究热点

Abstract: By optimizing the Latent Dirichlet Allocation(LDA) topic model in Spark Machine Learning Library(MLlib),this paper proposes an improved mining method of academic research hotspots.LDA is used to model the keywords of academic papers.The optimal number of topics in this topic model is determined by perplexity,and then the document-topic probability distribution matrix is transformed into document-topic rating matrix and topic-term probability distribution matrix into topic-term rating matrix.On this basis,the topics are sorted by calculating the similarity between background topics and topics in rating matrixes,so as to mine research hotspots in academic papers.Experimental results show that the proposed method can improve the mining performance of the LDA topic model,and discover valuable topics of academic research hotspots.

Key words: academic paper, Latent Dirichlet Allocation(LDA), background topic, topic sorting, research hotspot

中图分类号: