作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (10): 145-153. doi: 10.19678/j.issn.1000-3428.0065972

• 人工智能与模式识别 • 上一篇    下一篇

基于情节记忆的高效短文本流聚类算法

刘子健1, 王勇2, 刘媛妮3, 周由胜1,3   

  1. 1. 重庆邮电大学 计算机科学与技术学院, 重庆 400065
    2. 大唐微电子技术有限公司, 北京 100094
    3. 重庆邮电大学 网络空间安全与信息法学院, 重庆 400065
  • 收稿日期:2022-10-11 出版日期:2023-10-15 发布日期:2023-01-12
  • 作者简介:

    刘子健(1999—),男,硕士研究生,主研方向为数据挖掘

    王勇,高级工程师、硕士

    刘媛妮,副教授、博士

    周由胜,教授、博士

  • 基金资助:
    国家自然科学基金(62272076); 重庆市自然科学基金面上项目(cstc2020jcyj-msxmX0343); 重庆市自然科学基金面上项目(cstc2020jcyj-msxmX1021); 重庆市教委科学技术研究项目(KJZD-K20200602)

Efficient Clustering Algorithm of Short Text Streams Based on Episodic Memory

Zijian LIU1, Yong WANG2, Yuanni LIU3, Yousheng ZHOU1,3   

  1. 1. College of Computer and Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
    2. Datang Microelectronics Technology Co., Ltd., Beijing 100094, China
    3. College of Cyberspace Security and Information Law, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
  • Received:2022-10-11 Online:2023-10-15 Published:2023-01-12

摘要:

现有基于相似度的短文本流聚类算法多数需要手动设置相似度阈值,且难以处理文本稀疏性问题。针对短文本流的特点和传统流聚类算法的局限性,提出基于情节记忆的短文本流聚类算法。将情节记忆思想融入流聚类算法,通过稀疏经验重放增强聚类的特征表示,并使用反向索引提高聚类效率。在线阶段通过新的相似度计算公式以及动态计算相似度阈值,将当前文本分配到现有集群或新集群,并且不断更新聚类特征。离线阶段通过聚类增强、语义再分配以及删除过时聚类,提高整体算法性能。基于公开和合成数据集的实验结果表明,相较于基准流聚类算法,所提算法在各项评价指标上均取得了较好的实验结果,并且对于文本数量较大的数据集,运行时间能减少1~3个数量级。

关键词: 文本流聚类, 短文本流, 情节记忆, 主题演化, 文本特征

Abstract:

Most existing similarity-based short text stream clustering algorithms must manually set the similarity threshold, and it is difficult to solve the problem of text sparsity. Aiming at the characteristics of short text streams and the limitations of traditional stream clustering algorithms, a novel clustering algorithm of short text streams based on episodic memory is proposed. First, the idea of episodic memory is integrated into the stream clustering algorithm, and then, the feature representation of clustering is enhanced by sparse experience replay, and the clustering efficiency is improved by using reverse index. In the online stage, the current text is allocated to the existing cluster or new cluster via the new similarity calculation formula and the dynamic calculation of similarity threshold, and the clustering features are updated constantly. In the offline phase, the overall algorithm performance is improved through a clustering enhancement algorithm, semantic redistribution algorithm, and deleting outdated clustering algorithms. An experimental analysis based on public data sets and composite data sets shows that the proposed algorithm achieves better experimental results on various evaluation indicators compared with the benchmark stream clustering algorithms; for data sets with a large number of texts, the running time can be reduced by 1-3 orders of magnitude.

Key words: text stream clustering, short text stream, episodic memory, topic evolution, text feature