一种基于簇相合性的文本增量聚类算法

doi:10.3969/j.issn.1000-3428.2014.06.042

计算机工程

一种基于簇相合性的文本增量聚类算法

陶舒怡¹，王明文¹，万剑怡¹，罗远胜²，左家莉³

(1. 江西师范大学计算机信息工程学院，南昌 330022；2. 江西财经大学网络信息管理中心，南昌 330013；3. 江西师范大学初等教育学院，南昌 330027)

收稿日期:2013-02-28 出版日期:2014-06-15 发布日期:2014-06-13
作者简介:陶舒怡(1988－)，女，硕士研究生，主研方向：信息检索，数据挖掘；王明文，教授、博士生导师；万剑怡，教授；罗远胜，讲师、硕士；左家莉，讲师、博士。
基金资助:
国家自然科学基金资助项目(61272212)。

An Incremental Text Clustering Algorithm Based on Cluster Congruence

TAO Shu-yi ¹, WANG Ming-wen ¹, WAN Jian-yi ¹, LUO Yuan-sheng ², ZUO Jia-li ³

(1. School of Computer Information Engineering, Jiangxi Normal University, Nanchang 330022, China; 2. Network Information Management Center, Jiangxi University of Finance and Economics, Nanchang 330013, China; 3. School of Elementary Education, Jiangxi Normal University, Nanchang 330027, China)

Received:2013-02-28 Online:2014-06-15 Published:2014-06-13

摘要/Abstract

摘要： 传统文本聚类方法只适合处理静态样本，且时间复杂度较高。针对该问题，提出一种基于簇相合性的文本增量聚类算法。采用基于词项语义相似度的文本表示模型，利用词项之间的语义信息，通过计算新增文本与已有簇之间的相合性实现对文本的增量聚类。增量处理完部分文本后，对其中错分可能性较大的文本重新指派类别，以进一步提高聚类性能。该算法可在对象数据不断增长或更新的情况下，避免大量重复计算，提高聚类性能。在20 Newsgroups数据集上进行实验，结果表明，与k-means算法和SHC算法相比，该算法可减少聚类时间，提高聚类性能。

关键词: 文本聚类, 增量聚类, 语义相似度, 簇相合性, 文本再分配

Abstract: Traditional text clustering methods are only suitable for static sample, and their time complexity is too high. Aiming at these problems, this paper proposes a new Incremental Text Clustering Algorithm Based on Congruence(ITCAC) between text and cluster. The new algorithm can avoid a lot of double counting to improve the performance of clustering. It uses text representation model based on semantic similarity of lexical items, fully takes the semantic information between terms into account and computes the congruence between new documents and existing clusters. After processing part of the documents, the algorithm reassigns the categorization of documents that has large possibility of misclassification to further improve the clustering performance. Experimental results on 20 Newsgroups datasets show that, compared with the k-means algorithm and SHC algorithm, the new algorithm not only has less clustering time, but also has better performance of clustering.

Key words: text clustering, incremental clustering, semantic similarity, cluster congruence, text redistribution

中图分类号:

TP18

陶舒怡，王明文，万剑怡，罗远胜，左家莉. 一种基于簇相合性的文本增量聚类算法[J]. 计算机工程.

TAO Shu-yi, WANG Ming-wen, WAN Jian-yi, LUO Yuan-sheng, ZUO Jia-li. An Incremental Text Clustering Algorithm Based on Cluster Congruence[J]. Computer Engineering.

https://www.ecice06.com/CN/Y2014/V40/I6/195

参考文献

参考文献 [1] Fellbaum C. WordNet: An Electronic Lexical Database[M]. Cambridge, USA: MIT Press, 1998. [2] Gad W, Kamel M. New Semantic Similarity Based Model for Text Clustering Using Extended Gloss Overlaps[C]// Proceedings of the 6th International Conference on Machine Learning and Data Mining in Pattern Recognition. Berlin, Germany: [s. n.], 2009: 663-677. [3] Gad W K, Kamel M S. Incremental Clustering Algori- thm Based on Phrase-Semantic Similarity Histogram[C]// Proceedings of the 9th International Conference on Machine Learning and Cybernetics. Qingdao, China: [s. n.], 2010: 2088-2093. [4] Hammouda K, Kamel M. Incremental Document Clustering Using Cluster Similarity Histograms[C]//Proceedings of IEEE/ WIC International Conference on Web Intelligence. Beijing, China: [s. n.], 2003: 597-601. [5] 潘敏. 基于簇特征的文本增量聚类研究[D]. 南昌: 江西师范大学, 2012. [6] Vijaya P, Murthy M N, Subramanian D K. Leaders-Subleaders: An Efficient Hierarchical Clustering Algorithm for Large Data Sets[J]. Pattern Recognition Letters, 2004, 25(4): 505-513. [7] Srinivas M, Mohan C K. Efficient Clustering Approach Using Incremental and Hierarchical Clustering Methods[C]// Proceedings of IJCNN’10. Barcelona, Spain: [s. n.], 2010: 1743-1749. [8] Zhou Yang, Cheng Hong, Yu J X. Clustering Large Attributed Graphs: An Efficient Incremental Approach[C]//Proceedings of 2010 IEEE International Conference on Data Mining. Sydney, Australia: IEEE Press, 2010: 689-698. [9] Davidson I, Ravi S S, Ester M. Efficient Incremental Con- strained Clustering[C]//Proceedings of KDD’07. San Jose, USA: [s. n.], 2007: 240-249. [10] Banerjee S, Pedersen T. Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet[C]//Proceedings of the 3rd International Conference on Computational Linguistics and Intelligent Text Processing. Mexico City, Mexico: [s. n.], 2002: 136-145. [11] Lin Chengru, Chen M S. A Robust and Efficient Clustering Algorithm Based on Cohesion Self-Merging[C]//Proceedings of SIGKDD’02. Edmonton, Canada: ACM Press, 2002: 582-587. [12] Luo Yuansheng, Wang Mingweng, Le Zhongjian, et al. An Improved KNN Text Categorization Algorithm Based on Cluster Distribution[J]. Journal of Computational Information Systems, 2012, 8(3): 1-8. [13] Gad W, Kamel M. Ph-ssbm: Phrase Semantic Similarity Based Model for Document Clustering[C]//Proceedings of KAM’09. Wuhan, China: [s. n.], 2009: 197-200. 编辑金胡考

[1]	林加艺, 夏鸿斌, 刘渊. 基于类比学习的数学应用题求解模型[J]. 计算机工程, 2024, 50(7): 63-70.
[2]	李雪, 王雅文, 张前进. 基于信息检索的源代码自动命名[J]. 计算机工程, 2024, 50(6): 304-310.
[3]	潘伟, 黄瑞章, 任丽娜, 薛菁菁. 基于自适应结构学习的深度文本聚类[J]. 计算机工程, 2024, 50(11): 89-97.
[4]	杨振宇, 王磊, 马博, 杨雅婷, 董瑞, 艾孜麦提·艾瓦尼尔, 王震. 一种针对维汉的跨语言远程监督方法[J]. 计算机工程, 2023, 49(2): 271-278.
[5]	许伟佳, 秦永彬, 黄瑞章, 陈艳平. 基于DMA与特征划分的多源文本主题模型[J]. 计算机工程, 2021, 47(7): 59-66.
[6]	王劲松, 吕志梅, 赵泽宁, 张洪玮. 面向区块链交易可视分析的地址增量聚类方法[J]. 计算机工程, 2020, 46(8): 14-20.
[7]	肖晓丽,吴瑶,周锡玲,廖卓凡. 基于差分进化的两阶段文本特征选择算法[J]. 计算机工程, 2019, 45(2): 303-309,314.
[8]	王淑媛,田生伟,禹龙,冯冠军,艾山·吾买尔,李圃,赵建国. 基于堆栈降噪自编码的维吾尔语事件共指关系识别[J]. 计算机工程, 2018, 44(6): 305-310.
[9]	荆琪,段利国,李爱萍,赵谦. 基于维基百科的短文本相关度计算[J]. 计算机工程, 2018, 44(2): 197-202.
[10]	李晓红,曹林,宿云,马慧芳. 融合统计信息与语义相似度的特征扩展算法[J]. 计算机工程, 2017, 43(6): 177-181.
[11]	贾静兰,董才林,喻莹,王静,张丽芬. 基于回溯树的语义Web服务自动组合优化方法[J]. 计算机工程, 2016, 42(4): 215-220.
[12]	马雷雷,李宏伟,连世伟,梁汝鹏,陈虎. 一种基于本体语义的灾害主题爬虫策略[J]. 计算机工程, 2016, 42(11): 50-56.
[13]	易军凯,刘慕凡,万静. 基于主题与语义的作弊网页检测方法[J]. 计算机工程, 2015, 41(9): 311-316.
[14]	张翔,朱明,孙吟龙,方雪峰. 网络电视直播中的虚拟频道生成算法[J]. 计算机工程, 2015, 41(12): 236-240.
[15]	胡令传,陶晓鹏. 客户评论中用户体验信息自动提取研究[J]. 计算机工程, 2015, 41(1): 49-53.

选择文件类型/文献管理软件名称

选择包含的内容

一种基于簇相合性的文本增量聚类算法

An Incremental Text Clustering Algorithm Based on Cluster Congruence

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

一种基于簇相合性的文本增量聚类算法

An Incremental Text Clustering Algorithm Based on Cluster Congruence

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价