作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于回归模型与谱聚类的微博突发话题检测方法

彭敏  1,张泰玮  2,黄佳佳  1,朱佳晖  1,黄济民  1   

  1. (1.武汉大学计算机学院,武汉 430072; 2.武汉大学深圳研究院,广东 深圳 518000)
  • 收稿日期:2014-10-10 出版日期:2015-12-15 发布日期:2015-12-15
  • 作者简介:彭敏(1973-),女,教授、博士,主研方向:信息检索,网络服务,自然语言处理;张泰玮,硕士研究生;黄佳佳,博士研究生;朱佳晖,硕士研究生;黄济民,学士。
  • 基金资助:
    国家自然科学基金资助项目“社会网络的主题演化分析与传播趋势预测研究”(61472291);深圳市知识创新计划基础研究基金资助项目“基于压缩感知的社交网络主题提取与演化分析”。

Microblog Sudden Topic Detection Method Based on Regression Models and Spectral Clustering

PENG Min  1,ZHANG Taiwei  2,HUANG Jiajia  1,ZHU Jiahui  1,HUANG Jimin  1   

  1. (1.School of Computer Science,Wuhan University,Wuhan 430072,China;2.Shenzhen Institute,Wuhan University,Shenzhen 518000,China)
  • Received:2014-10-10 Online:2015-12-15 Published:2015-12-15

摘要: 微博社交网络短文本具有数据规模巨大、快速传播、模态多样、质量较低等特性,导致现有传统的话题检测与跟踪技术在对其进行数据处理时面临复杂度高、特征稀疏和噪声干扰等问题。为此,提出一种在回归预测和谱聚类基础上的突发话题检测方法。该方法针对关键词词频变动趋势,基于回归模型,量化微博关键词的突发程度,从词频趋势分析的角度准确地提取出突发词集合。设计一个基于谱聚类思想的突发词聚类方法提高聚类结果的准确性。在大规模微博数据集的实验结果证明,与baseline方法相比,该方法的准确率、召回率、F值都有较大提高,在微博信息分析领域有着较好的应用前景。

关键词: 微博, 突发话题检测, 词频分析, 回归模型, 谱聚类, 大数据

Abstract: The short text of the social network-microblog has the characters of great data dimensions,fast propagation,modal diversity,low quality,etc.,which result in facing with great challenges such as high complexity,features sparse and noise interference when dealing with the data by the existing traditional topic detection and tracking method.For emerging topic detection,this paper presents a method of microblog emerging topic detection based on regression models and spectral clustering.The method quantifies the emerging frequency of microblog keywords by their trends and regression models.The unexpected words are extracted by analyzing the frequency of words changing in trends.An emerging word clustering method is designed based on spectral clustering to improve the accuracy.Experimental results based on microblog data set show that compared with baseline method,the proposed method achieves better accuracy,higher recall rate and F value.

Key words: microblog, sudden topic detection, words frequency analysis, regression model, spectral clustering, big data

中图分类号: