基于改进k-medoids算法的XML文档聚类

doi:10.3969/j.issn.1000-3428.2015.09.010

计算机工程

基于改进k-medoids算法的XML文档聚类

冯少荣,潘炜炜,林子雨

(厦门大学信息科学与技术学院,福建厦门 361005)

收稿日期:2014-09-01 出版日期:2015-09-15 发布日期:2015-09-15
作者简介:冯少荣（1964-），男，副教授、博士，主研方向：机器学习，数据挖掘；潘炜炜，硕士研究生；林子雨，讲师、博士。
基金资助:
国家自然科学基金资助项目(61303004)；国家社会科学基金资助重大项目(13&ZD148)；福建省自然科学基金资助项目(2013J05099)。

XML Documents Clustering Based on Improved k-medoids Algorithm

FENG Shaorong,PAN Weiwei,LIN Ziyu

(School of Information Science and Engineering,Xiamen University,Xiamen 361005,China)

Received:2014-09-01 Online:2015-09-15 Published:2015-09-15

摘要/Abstract

摘要： XML文档由于其自身的可扩展性、半结构化和自描述性等特点,已成为数据表示和交换的数据格式标准。一个高效、快速的XML文档聚类机制能够大幅缩短信息检索时间,提高数据查询的效率,挖掘出潜在的信息价值。为此,提出一种改进的k-medoids算法对XML文档进行聚类。运用模糊聚类方法确定聚类个数,利用遗传算法的全局最优的搜索能力求解最佳聚类中心点或质心,从而提高大规模XML文档集的聚类质量。实验结果表明,与基于传统k-medoids算法的聚类方法相比,改进的聚类方法具有较高的聚类准确性和收敛度。

关键词: XML文档聚类, 遗传算法, 模糊聚类, k-medoids聚类, 聚类个数, 聚类中心

Abstract: Due to extensibility,semi-structured and ability of self-description and other characteristics,eXtensible Markup Language(XML) has been the standard of data representation and exchange.An efficient,fast XML clustering mechanism,will greatly shorten the information retrieval time,improve the efficiency of data query and find out the potential information value.In order to improve the clustering quality of massive XML document collections,a novel XML document clustering method is proposed based on the study of structure and the similarity in the XML documents,according to the improved k-medoids clustering algortihm.The analyses of experimental results show that the proposed method has satisfactory clustering convergence and accuracy.

Key words: XML documents clustering, Genetic Algorithm(GA), fuzzy clustering, k-medoids clustering, clustering number, clustering center

中图分类号:

TP311

冯少荣,潘炜炜,林子雨. 基于改进k-medoids算法的XML文档聚类[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2015.09.010.

FENG Shaorong,PAN Weiwei,LIN Ziyu. XML Documents Clustering Based on Improved k-medoids Algorithm[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2015.09.010.

http://www.ecice06.com/CN/Y2015/V41/I9/56

参考文献

参考文献［1］Abiteboul S,Buneman P,Suciu D.Data on the Web［M］.San Francisco,USA:Morgan Kaufmann,2000. ［2］孟小峰.XML数据管理:概念与技术［Ｍ］.北京:清华大学出版社,2009. ［3］Mazuran M,Quintarelli E,Tanca L.Data Mining for XML Query-answering Support［J］.IEEE Transactions on Knowledge and Data Engineering,2012,24(8):1393-1407. ［4］Han Jiawei,Chang K C.Data Mining for Web Intelligence［J］.Computer,2002,35(11):64-70. ［5］Wang Lian,Mamoulis N,Cheung D W,et al.Indexing Useful Structural Patterns for XML Query Pro-cessing［J］.IEEE Transactions on Knowledge and Data Engineering,2005,17(7):997-1009. ［6］Lloyd S P.Least Squares Quantization in PCM［J］.IEEE Transactions on Information Theory,1982,28(2):129-137. ［7］Kaufman L,Rousseeuw P J.Finding Groups in Data:An Introduction to Cluster Analysis［EB/OL］.(2008-05-27).http://as.wiley.com/WileyCDA/WileyTitle/productCd-0471735787.html. ［8］Nayak R.Investigating Semantic Measures in XML Clustering［C］//Proceedings of 2006 IEEE/WIC/ACM International Conference on Web Intelligence.Washington D.C.,USA:IEEE Press,2006:1042-1045. ［9］Shasha D,Wang J T L,Zhang Kaizhong,et al.Exact and Approximate Algorithms for Unordered Tree Matching［J］.IEEE Transactions on Systems,Man and Cybernetics,1994,24(4):668-678. ［10］Zhang Kaizhong,Statman R,Shasha D.On the Editing Distance Between Unordered Labeled Trees［J］.Information Processing Letters,1992,42(3):133-139. ［11］Choi I,Moon B,Kim H J.A Clustering Method Based on Path Similarities of XML Data［J］.Data & Knowledge Engineering,2007,60(2):361-376. ［12］Joshi S,Agrawal N,Krishnapuram R,et al.A Bag of Paths Model for Measuring Structural Similarity in Web Documents［C］//Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York,USA:ACM Press,2003:577-582. ［13］朴勇,王秀坤.一种 XML 文档结构相似度计算方法［J］.控制与决策,2010,25(4):497-501. ［14］Sheng Weiguo,Liu Xiaohui.A Genetic k-medoids Clustering Algorithm［J］.Journal of Heuristics,2006,12(6):447-466. ［15］Wu Jianan,Zhou Chunguang,Li Zhangxu,et al.A Novel Algorithm for Generating Simulated Genetic Data Based on k-medoids［C］//Proceedings of the 2nd International Conference on Cloud Computing and Intelligent Systems.Washington D.C.,USA:IEEE Press,2012:25-28. ［16］李敏强,寇纪淞,林丹,等.遗传算法的基本理论与应用［M］.北京:科学出版社,2002. 编辑金胡考

[1]	白祉旭, 王衡军. 基于改进遗传算法的对抗样本生成方法[J]. 计算机工程, 2023, 49(5): 139-149.
[2]	桑永宣, 魏江坡, 王博, 宋莹. 具有边缘缓存机制的混合启发式任务卸载算法[J]. 计算机工程, 2023, 49(4): 149-158.
[3]	乔彩彩, 吴成茂, 李昌兴, 王佳烨. 结合隶属度与像素交替引导滤波的鲁棒模糊聚类算法[J]. 计算机工程, 2022, 48(8): 224-233.
[4]	马华伟, 马凯, 郭君. 考虑多投递的带无人机车辆路径规划问题研究[J]. 计算机工程, 2022, 48(8): 299-305.
[5]	王芙银, 张德生, 肖燕婷. 基于加权共享近邻与累加序列的密度峰值算法[J]. 计算机工程, 2022, 48(4): 61-69.
[6]	宋勇春, 王茜竹, 高正念. 基于HAGA的D2D-NOMA资源分配优化算法[J]. 计算机工程, 2022, 48(2): 275-280,290.
[7]	缪欣, 陈璇, 鲍红莹, 张静轩, 余炜. 移动传感器网络中路径扫描覆盖问题研究[J]. 计算机工程, 2022, 48(12): 150-155,164.
[8]	吴铁洲, 邹智, 姜奔, 张晓星. 基于TLBGA-GRU神经网络的短期负荷预测[J]. 计算机工程, 2022, 48(11): 69-76.
[9]	曾蓉晖, 林兵, 王明芬, 林凯, 卢宇. 超密集边缘计算网络中面向能耗优化的任务卸载方法[J]. 计算机工程, 2022, 48(11): 39-48.
[10]	杜秀丽, 周敏, 吕亚娜, 邱少明. 基于RBF神经网络优化的装备保障系统效能评估[J]. 计算机工程, 2021, 47(9): 282-287,296.
[11]	魏秀然, 王峰. 基于协调器与遗传算法的云存储数据复制策略[J]. 计算机工程, 2021, 47(8): 124-130,139.
[12]	曹志鹏, 刘勤让, 刘冬培, 张霞. 面向时间敏感网络的流量调度方法[J]. 计算机工程, 2021, 47(7): 168-175,182.
[13]	刘丹, 耿娜. 基于两阶段随机仿真优化算法的体检顾客预约调度[J]. 计算机工程, 2021, 47(7): 281-288.
[14]	郑娟毅, 崔卓, 苏海龙, 殷帅帅, 刘遥遥. 基于改进GA-Elman的无线智能传播损耗预测方法[J]. 计算机工程, 2021, 47(7): 155-160,167.
[15]	王治和, 王淑艳, 杜辉. 基于密度敏感距离的改进模糊C均值聚类算法[J]. 计算机工程, 2021, 47(5): 88-96,103.

选择文件类型/文献管理软件名称

选择包含的内容

基于改进k-medoids算法的XML文档聚类

XML Documents Clustering Based on Improved k-medoids Algorithm

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于改进k-medoids算法的XML文档聚类

XML Documents Clustering Based on Improved k-medoids Algorithm

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价