基于情节记忆的高效短文本流聚类算法

doi:10.19678/j.issn.1000-3428.0065972

计算机工程 ›› 2023, Vol. 49 ›› Issue (10): 145-153. doi: 10.19678/j.issn.1000-3428.0065972

基于情节记忆的高效短文本流聚类算法

刘子健¹, 王勇², 刘媛妮³, 周由胜¹^,³

1. 重庆邮电大学计算机科学与技术学院, 重庆 400065
2. 大唐微电子技术有限公司, 北京 100094
3. 重庆邮电大学网络空间安全与信息法学院, 重庆 400065

收稿日期:2022-10-11 出版日期:2023-10-15 发布日期:2023-01-12
作者简介:
刘子健(1999—)，男，硕士研究生，主研方向为数据挖掘
王勇，高级工程师、硕士
刘媛妮，副教授、博士
周由胜，教授、博士
基金资助:
国家自然科学基金(62272076); 重庆市自然科学基金面上项目(cstc2020jcyj-msxmX0343); 重庆市自然科学基金面上项目(cstc2020jcyj-msxmX1021); 重庆市教委科学技术研究项目(KJZD-K20200602)

Efficient Clustering Algorithm of Short Text Streams Based on Episodic Memory

Zijian LIU¹, Yong WANG², Yuanni LIU³, Yousheng ZHOU¹^,³

1. College of Computer and Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
2. Datang Microelectronics Technology Co., Ltd., Beijing 100094, China
3. College of Cyberspace Security and Information Law, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

Received:2022-10-11 Online:2023-10-15 Published:2023-01-12

摘要/Abstract

摘要：

现有基于相似度的短文本流聚类算法多数需要手动设置相似度阈值，且难以处理文本稀疏性问题。针对短文本流的特点和传统流聚类算法的局限性，提出基于情节记忆的短文本流聚类算法。将情节记忆思想融入流聚类算法，通过稀疏经验重放增强聚类的特征表示，并使用反向索引提高聚类效率。在线阶段通过新的相似度计算公式以及动态计算相似度阈值，将当前文本分配到现有集群或新集群，并且不断更新聚类特征。离线阶段通过聚类增强、语义再分配以及删除过时聚类，提高整体算法性能。基于公开和合成数据集的实验结果表明，相较于基准流聚类算法，所提算法在各项评价指标上均取得了较好的实验结果，并且对于文本数量较大的数据集，运行时间能减少1~3个数量级。

关键词: 文本流聚类, 短文本流, 情节记忆, 主题演化, 文本特征

Abstract:

Most existing similarity-based short text stream clustering algorithms must manually set the similarity threshold, and it is difficult to solve the problem of text sparsity. Aiming at the characteristics of short text streams and the limitations of traditional stream clustering algorithms, a novel clustering algorithm of short text streams based on episodic memory is proposed. First, the idea of episodic memory is integrated into the stream clustering algorithm, and then, the feature representation of clustering is enhanced by sparse experience replay, and the clustering efficiency is improved by using reverse index. In the online stage, the current text is allocated to the existing cluster or new cluster via the new similarity calculation formula and the dynamic calculation of similarity threshold, and the clustering features are updated constantly. In the offline phase, the overall algorithm performance is improved through a clustering enhancement algorithm, semantic redistribution algorithm, and deleting outdated clustering algorithms. An experimental analysis based on public data sets and composite data sets shows that the proposed algorithm achieves better experimental results on various evaluation indicators compared with the benchmark stream clustering algorithms; for data sets with a large number of texts, the running time can be reduced by 1-3 orders of magnitude.

Key words: text stream clustering, short text stream, episodic memory, topic evolution, text feature

刘子健, 王勇, 刘媛妮, 周由胜. 基于情节记忆的高效短文本流聚类算法[J]. 计算机工程, 2023, 49(10): 145-153.

Zijian LIU, Yong WANG, Yuanni LIU, Yousheng ZHOU. Efficient Clustering Algorithm of Short Text Streams Based on Episodic Memory[J]. Computer Engineering, 2023, 49(10): 145-153.

http://www.ecice06.com/CN/Y2023/V49/I10/145

图/表 10

图1 情节记忆模块结构

Fig.1 Structure of episodic memory module

图2 聚类id-特征正/反向索引

Fig.2 Cluster id-feature forward/reverse index

图3 算法整体流程

Fig.3 Overall procedure of the algorithm

图4 内存大小对归一化互信息指标的影响

Fig.4 Influence of the memory size on the normalized mutual information index

图5 重放间隔对归一化互信息指标的影响

Fig.5 Influence of the replay interval on the normalized mutual information index

图6 重放文本数量对归一化互信息指标的影响

Fig.6 Influence of the number of replay texts on the normalized mutual information index

图7 重放文本数量对算法运行时间的影响

Fig.7 Influence of the number of replay texts on the running time of the algorithm

参考文献 25

1	AGGARWAL C C. A survey of stream clustering algorithms[M]. [S. l. ]: Chapman and Hall, 2018.
2	NGUYEN H L, WOON Y K, NG W K. A survey on data stream clustering and classification. Knowledge and Information Systems, 2015, 45(3): 535- 569. doi: 10.1007/s10115-014-0808-1
3	SILVA J A, FARIA E R, BARROS R C, et al. Data stream clustering: A survey. ACM Computing Surveys, 2013, 46(1): 1- 31.
4	谢娟英, 王艳娥. 最小方差优化初始聚类中心的K-means算法. 计算机工程, 2014, 40(8): 205-211, 223. URL
	XIE J Y, WANG Y E. K-means algorithm based on minimum deviation initialized clustering centers. Computer Engineering, 2014, 40(8): 205-211, 223. URL
5	刘攀登, 刘清明. 稀疏数据中基于高斯混合模型的位置推荐框架. 计算机工程, 2018, 44(1): 62- 68. URL
	LIU P D, LIU Q M. Location recommendation framework based on Gaussian mixture model in sparse data. Computer Engineering, 2018, 44(1): 62- 68. URL
6	葛君伟, 杨广欣. 基于共享最近邻的密度自适应邻域谱聚类算法. 计算机工程, 2021, 47(8): 116- 123. URL
	GE J W, YANG G X. Spectral clustering algorithm for density adaptive neighborhood based on shared nearest neighbors. Computer Engineering, 2021, 47(8): 116- 123. URL
7	AGGARWAL C C, YU P S, HAN J W, et al. A framework for clustering evolving data streams[C]//Proceedings of 2003 VLDB Conference. Amsterdam, Holland: Elsevier, 2003: 81-92.
8	CAO F, ESTERT M, QIAN W N, et al. Density-based clustering over an evolving data stream with noise[C]//Proceedings of 2006 SIAM International Conference on Data Mining. Philadelphia, USA: Society for Industrial and Applied Mathematics, 2006: 328-339.
9	SHOU L D, WANG Z H, CHEN K, et al. Sumblr: continuous summarization of evolving tweet streams[C]//Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2013: 533-542.
10	YIN J H, CHAO D R, LIU Z K, et al. Model-based clustering of short text streams[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, USA: ACM Press, 2018: 2634-2642.
11	RAKIB M R H, ZEH N, MILIOS E. Efficient clustering of short text streams using online-offline clustering[C]//Proceedings of the 21st ACM Symposium on Document Engineering. New York, USA: ACM Press, 2021: 1-10.
12	CHEN J Y, GONG Z G, LIU W W. A Dirichlet process biterm-based mixture model for short text stream clustering. Applied Intelligence, 2020, 50(5): 1609- 1619. doi: 10.1007/s10489-019-01606-1
13	KUMAR J, SHAO J M, UDDIN S, et al. An online semantic-enhanced Dirichlet model for short text stream clustering[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: Association for Computational Linguistics, 2020: 766-776.
14	LIANG S S, YILMAZ E, KANOULAS E. Dynamic clustering of streaming short documents[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2016: 995-1004.
15	KUMAR J, DIN S U, YANG Q L, et al. An online semantic-enhanced graphical model for evolving short text stream clustering. IEEE Transactions on Cybernetics, 2022, 52(12): 13809- 13820.
16	QIANG J P, XU W Y, LI Y, et al. Lifelong learning augmented short text stream clustering method. IEEE Access, 2021, 9, 70493- 70501.
17	ISHWARAN H, JAMES L F. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 2001, 96(453): 161- 173.
18	CHU D N, REYERS M, THOMSON J, et al. Route identification in the national football league. Journal of Quantitative Analysis in Sports, 2020, 16(2): 121- 132.
19	D'AUTUME C D M, RUDER S, KONG L P, et al. Episodic memory in lifelong language learning[EB/OL]. [2022-09-11]. https://arxiv.org/abs/1906. 01076.
20	YAN X H, GUO J F, LAN Y Y, et al. A biterm topic model for short texts[C]//Proceedings of the 22nd International Conference on World Wide Web. New York, USA: ACM Press, 2013: 1445-1456.
21	PENNINGTON J, SOCHER R, MANNING C. GloVe: global vectors for word representation[C]//Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: Association for Computational Linguistics, 2014: 1532-1543.
22	RAKIB M R H, ZEH N, JANKOWSKA M, et al. Enhancement of short text clustering by iterative classification[C]//Proceedings of International Conference on Applications of Natural Language to Information Systems. Berlin, Germany: Springer, 2020: 105-117.
23	YIN J H, WANG J Y. A Dirichlet multinomial mixture model-based approach for short text clustering[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2014: 233-242.
24	ABUALIGAH L M Q. Feature selection and enhanced Krill Herd algorithm for text document clustering. Berlin, Germany: Springer, 2019.
25	MILLS-TETTEY G A, STENTZ A, DIAS M B. The dynamic Hungarian algorithm for the assignment problem with changing costs: CMU-RI-TR-07-27[R]. Pittsburgh, USA: Robotics Institute, 2007.

[1]	刘栋, 杨辉, 姬少培, 曹扬. 基于多模型加权组合的文本相似度计算模型[J]. 计算机工程, 2023, 49(10): 97-104.
[2]	姚立,张曦煌. 一种基于标签的改进主题演化模型[J]. 计算机工程, 2019, 45(4): 205-210,216.
[3]	高永兵,杨利莹,胡文江,马占飞. 基于HDP模型的领域微博主题演化研究[J]. 计算机工程, 2018, 44(2): 1-8.
[4]	郑诚,沈磊,代宁. 基于类序列规则的中文微博情感分类[J]. 计算机工程, 2016, 42(2): 184-189,194.
[5]	方爽,殷俊杰,徐武平. 基于相似图片聚类的Web文本特征算法[J]. 计算机工程, 2014, 40(12): 161-165,171.
[6]	孙劲光，马志芳，孟祥福. 基于情感词属性和云模型的文本情感分类方法[J]. 计算机工程, 2013, 39(12): 211-215.
[7]	高茂庭;王正欧. 基于文档标引图模型的文本相似度策略[J]. 计算机工程, 2008, 34(7): 19-22.

选择文件类型/文献管理软件名称

选择包含的内容

基于情节记忆的高效短文本流聚类算法

Efficient Clustering Algorithm of Short Text Streams Based on Episodic Memory

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 25

相关文章 7

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于情节记忆的高效短文本流聚类算法

Efficient Clustering Algorithm of Short Text Streams Based on Episodic Memory

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 25

相关文章 7

编辑推荐

Metrics

本文评价