基于符号熵的序列相似性度量方法

doi:10.3969/j.issn.1000-3428.2016.05.034

计算机工程

基于符号熵的序列相似性度量方法

张豪,陈黎飞,郭躬德

(福建师范大学数学与计算机科学学院福建省网络安全与密码技术重点实验室,福州350007)

收稿日期:2015-03-09 出版日期:2016-05-15 发布日期:2016-05-13
作者简介:张豪(1987-),男,硕士研究生,主研方向为数据挖掘;陈黎飞,副教授、博士;郭躬德,教授、博士。
基金资助:
国家自然科学面上基金资助项目“面向软件行为鉴别的事件序列挖掘方法研究”(61175123);福建师范大学创新团队基金资助项目(IRTL1207)。

Sequence Similarity Measurement Method Based on Symbol Entropy

ZHANG Hao,CHEN Lifei,GUO Gongde

(Fujian Province Key Laboratory of Network Security and Password Technology, School of Mathematics and Computer Science,Fujian Normal University,Fuzhou 350007,China)

Received:2015-03-09 Online:2016-05-15 Published:2016-05-13

摘要/Abstract

摘要： 现有序列相似性度量算法在子序列相似性度量中仅考虑其局部相似度,忽略了其所属序列的整体结构信息。为此,提出一种以单个符号的熵为基础的序列相似性度量方法,根据同一序列中相同符号的位置及个数信息得出符号熵。通过凝聚型层次聚类结果验证序列相似性度量方法,在多个领域的符号序列数据集上的实验结果表明,与现有的基于子序列局部相似性方法相比,该相似性度量方法有效提高了聚类结果质量。

关键词: 符号序列, 相似度, 熵, 层次聚类, 序列聚类

Abstract: Existing sequence similarity measurement algorithms only consider the local similarity of subsequences,ignoring global structure information.Thus,a similarity measurement method based on the entropy of single symbol for sequences is proposed.The entropy of a symbol is computed according to the positions and numbers of all the same symbols in a sequence.Through verifying the validity of the new sequence similarity measurement method by agglomerative hierarchical clustering,experimental results on a plurality of datasets show that,compared with the existing methods based on local similarity of substring,the new similarity measurement method can improve the clustering accuracy significantly.

Key words: symbol sequence, similarity, entropy, hierarchical clustering, sequence clustering

中图分类号:

TP18

张豪,陈黎飞,郭躬德. 基于符号熵的序列相似性度量方法[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2016.05.034.

ZHANG Hao,CHEN Lifei,GUO Gongde. Sequence Similarity Measurement Method Based on Symbol Entropy[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2016.05.034.

http://www.ecice06.com/CN/Y2016/V42/I5/201

参考文献

参考文献［1］Xiong T,Wang S,Jiang Q,et al.A New Markov Model for Clustering Categorical Sequences［C］//Proceedings of International Conference on Data Mining.Washington D.C.,USA:IEEE Press,2011:854-863. ［2］Dong Guozhu,Pei Jian.Sequence Data Mining［M］.New York,USA:Springer-Verlag New York Inc.,2007. ［3］陈黎飞,郭躬德.属性加权的类属型数据非模聚类［J］.软件学报,2013,24(11):2628-2641. ［4］Alpaydin E.机器学习导论［M］.范明,译.北京:机械工业出版社,2009. ［5］Kelil A,Wang S,Brzezinski R,et al.CLUSS:Clustering of Protein Sequences Based on a New Similarity Measure［J］.BMC Bioinformatics,2007,8(1):286. ［6］孙吉贵,刘杰,赵连宇.聚类算法研究［J］.软件学报,2008,19(1):48-61. ［7］Ron D,Singer Y,Tishby N.ThePower of Amnesia:Learning Probabilistic Automata with Variable Memory Length［J］.Machine Learning,1996,25(2-3):117-149. ［8］Grossi R,Vitter J.Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching［C］//Proceedings of ACM STOC’00.New York,USA:ACM Press,2000:397-406. ［9］Gusfield D.Algorithms on Strings,Trees,and Sequences［J］.ACM SIGACT News,1997,28(4):41-60. ［10］Ukkonen E.On-line Construction of Suffix Trees［J］.Algorithmica,1995,14(3):249-260. (下转第212页) (上接第206页) ［11］Yang J,Wang W.CLUSEQ:Efficient and Effective Sequence Clustering［C］//Proceedings of IEEE Inter-national Conference on Data Engineering.Washington D.C.,USA:IEEE Press,2003:101-112. ［12］Kondrak G.N-gram Similarity and Distance［C］//Pro-ceedings of IEEE International Conference on String Processing and Information Retrieval.Washington D.C.,USA:IEEE Press,2005:115-126. ［13］Kelil A,Wang S.SCS:A New Similarity Measure for Categorical Sequences［C］//Proceedings of IEEE Inter-national Conference on Data Mining.Washington D.C.,USA:IEEE Press,2008:343-352. ［14］Wei D,Jiang Q,Wei Y,et al.A Novel Hierarchical Clustering Algorithm for Gene Sequences［J］.BMC Bioinformatics,2012,13(1):174. ［15］Schmitt A O,Herzel H.Estimating the Entropy of DNA Sequences［J］.Journal of Theoretical Biology,1997,188(3):369-377. ［16］Longest Common Subsequence.［EB/OL］.(2012-10-21). http://www.cs.ucf.edu/courses/cap5937/fall2004/Longe st%20common%20subsequence.pdf. ［17］Halkidi M,Batistakis Y,Vazrgiannis M.On Clustering Validation Techniques［J］.Intelligent in Formation Systems,2001,17(2-3):107-145. ［18］Larsen B,Aone C.Fast and Effective Text Mining Using Linear-time Document Clustering［C］//Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York,USA:ACM Press,1999:16-22. 编辑索书志

[1]	张天骐, 闻斌, 熊天, 吴超. 基于张量分解与场景分割的鲁棒视频水印算法[J]. 计算机工程, 2023, 49(8): 250-256, 264.
[2]	付雪, 朱良宽, 黄建平, 王璟瑀, ARYSTANRyspayev. 基于改进北方苍鹰优化算法的多阈值图像分割[J]. 计算机工程, 2023, 49(7): 232-241.
[3]	程小辉, 李钰, 康燕萍. 基于中间图特征提取的卷积网络双标准剪枝[J]. 计算机工程, 2023, 49(3): 105-112.
[4]	陈何雄, 罗宇薇, 韦云凯, 郭威, 杭菲璐, 何映军, 杨宁. 基于联邦学习的SDN异常流量协同检测技术[J]. 计算机工程, 2023, 49(3): 168-176.
[5]	胡慧旗, 张维强, 徐晨. 判别性增强的稀疏子空间聚类[J]. 计算机工程, 2023, 49(2): 98-104.
[6]	杨振宇, 王磊, 马博, 杨雅婷, 董瑞, 艾孜麦提·艾瓦尼尔, 王震. 一种针对维汉的跨语言远程监督方法[J]. 计算机工程, 2023, 49(2): 271-278.
[7]	李海林, 夏燕燕, 邹金串. 基于CPET时序聚类的中长跑耐力运动员选拔方法[J]. 计算机工程, 2022, 48(9): 262-268.
[8]	潘金凤, 尹丽菊, 高明亮, 邹国峰. 压缩感知观测信号的低秩稀疏分解[J]. 计算机工程, 2022, 48(8): 234-239.
[9]	孙福禄, 王宇嘉, 刘子怡. 基于节点引力与鱼记忆的社区检测算法[J]. 计算机工程, 2022, 48(5): 104-111.
[10]	张恒, 陈晓红, 蓝宇翔, 李舜酩. 基于深度学习的监督型典型相关分析[J]. 计算机工程, 2022, 48(5): 222-228.
[11]	周瑞朋, 秦进. 基于最佳子策略记忆的强化探索策略[J]. 计算机工程, 2022, 48(2): 106-112.
[12]	田盼盼, 陈璟. 基于层次聚类的生物网络全局比对算法[J]. 计算机工程, 2022, 48(2): 65-71,78.
[13]	王治和, 曹旭琰, 杜辉. 一种优化初始点与自适应半径的密度聚类算法[J]. 计算机工程, 2022, 48(1): 51-59.
[14]	朱映波, 赵阳洋, 王佩, 尹凯, 王振宇. 融合马尔科夫决策过程与信息熵的对话策略[J]. 计算机工程, 2021, 47(3): 284-290.
[15]	石彩霞, 李书琴, 刘斌. 多重检验加权融合的短文本相似度计算方法[J]. 计算机工程, 2021, 47(2): 95-102.

选择文件类型/文献管理软件名称

选择包含的内容

基于符号熵的序列相似性度量方法

Sequence Similarity Measurement Method Based on Symbol Entropy

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于符号熵的序列相似性度量方法

Sequence Similarity Measurement Method Based on Symbol Entropy

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价