作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 先进计算与数据处理 • 上一篇    下一篇

基于语义的文本数据流概念漂移检测算法

储光,胡学钢,张玉红   

  1. (合肥工业大学 计算机与信息学院,合肥 230009)
  • 收稿日期:2017-01-04 出版日期:2018-02-15 发布日期:2018-02-15
  • 作者简介:储光(1990—),男,硕士研究生,主研方向为数据挖掘;胡学钢,教授、博士生导师;张玉红,副教授、博士。
  • 基金资助:
    国家重点研发计划项目(2016YFC0801406);国家自然科学基金(61503112,61673152)。

Semantic-based Concept Drift Detection Algorithm for Text Data Stream

CHU Guang,HU Xuegang,ZHANG Yuhong   

  1. (School of Computer and Information,Hefei University of Technology,Hefei 230009,China)
  • Received:2017-01-04 Online:2018-02-15 Published:2018-02-15

摘要: 文本数据流中概念的频繁漂移导致有效信息不足,从而使得漂移检测和数据流分类准确率下降。针对该问题,引入潜在狄利克雷分布模型并考虑文本数据流隐含的语义信息,提出一种新的概念漂移检测算法。计算相邻模块中词和主题特征空间的语义相似度,其中主题的相似度根据主题-单词概率分布进行评估,当2个特征空间相似度都较低时判断为发生概念漂移。实验结果表明,与DDM、CDRDT、DWCDS、HDDM-W-Test和REDLLA算法相比,该算法对文本数据流中概念漂移的检测性能均有所提升,尤其在概念频繁漂移时可以显著减少漏检数量。

关键词: 概念漂移, 语义, 漂移检测, 潜在狄利克雷分布模型, 文本数据流分类

Abstract: In text data stream,frequent concept drifts result in the poor effective information,thus the accuracy rates of drift detection and stream classification are lower.To address this problem,by introducing Latent Dirichlet Allocation(LDA) model and considering the semantic information of text data stream,this paper proposes a new concept drift detection algorithm.It calculates the semantic similarities of both word and topic feature spaces between adjacent modules,in which the similarity of topics is evaluated by the probability distribution of topic-word.It is considered that concept drifts occur when the similarities are lower in these two spaces.Experimental results show that,compared with DDM,CDRDT,DWCDS,HDDM-W-Test and REDLLA algorithms,the proposed algorithm can improve the performance in the concept drift detection.Especially,it can significantly reduce the missing drifts when concept frequently drifts.

Key words: concept drift, semantic, drift detection, Latent Dirichlet Allocation(LDA) model, text data stream classification

中图分类号: