摘要: 微博社交网络短文本具有数据规模巨大、快速传播、模态多样、质量较低等特性,导致现有传统的话题检测与跟踪技术在对其进行数据处理时面临复杂度高、特征稀疏和噪声干扰等问题。为此,提出一种在回归预测和谱聚类基础上的突发话题检测方法。该方法针对关键词词频变动趋势,基于回归模型,量化微博关键词的突发程度,从词频趋势分析的角度准确地提取出突发词集合。设计一个基于谱聚类思想的突发词聚类方法提高聚类结果的准确性。在大规模微博数据集的实验结果证明,与baseline方法相比,该方法的准确率、召回率、F值都有较大提高,在微博信息分析领域有着较好的应用前景。
关键词:
微博,
突发话题检测,
词频分析,
回归模型,
谱聚类,
大数据
Abstract: The short text of the social network-microblog has the characters of great data dimensions,fast propagation,modal diversity,low quality,etc.,which result in facing with great challenges such as high complexity,features sparse and noise interference when dealing with the data by the existing traditional topic detection and tracking method.For emerging topic detection,this paper presents a method of microblog emerging topic detection based on regression models and spectral clustering.The method quantifies the emerging frequency of microblog keywords by their trends and regression models.The unexpected words are extracted by analyzing the frequency of words changing in trends.An emerging word clustering method is designed based on spectral clustering to improve the accuracy.Experimental results based on microblog data set show that compared with baseline method,the proposed method achieves better accuracy,higher recall rate and F value.
Key words:
microblog,
sudden topic detection,
words frequency analysis,
regression model,
spectral clustering,
big data
中图分类号:
彭敏,张泰玮,黄佳佳,朱佳晖,黄济民. 基于回归模型与谱聚类的微博突发话题检测方法[J]. 计算机工程.
PENG Min,ZHANG Taiwei,HUANG Jiajia,ZHU Jiahui,HUANG Jimin. Microblog Sudden Topic Detection Method Based on Regression Models and Spectral Clustering[J]. Computer Engineering.