作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于符号熵的序列相似性度量方法

张豪,陈黎飞,郭躬德   

  1. (福建师范大学数学与计算机科学学院福建省网络安全与密码技术重点实验室,福州350007)
  • 收稿日期:2015-03-09 出版日期:2016-05-15 发布日期:2016-05-13
  • 作者简介:张豪(1987-),男,硕士研究生,主研方向为数据挖掘;陈黎飞,副教授、博士;郭躬德,教授、博士。
  • 基金资助:
    国家自然科学面上基金资助项目“面向软件行为鉴别的事件序列挖掘方法研究”(61175123);福建师范大学创新团队基金资助项目(IRTL1207)。

Sequence Similarity Measurement Method Based on Symbol Entropy

ZHANG Hao,CHEN Lifei,GUO Gongde   

  1. (Fujian Province Key Laboratory of Network Security and Password Technology, School of Mathematics and Computer Science,Fujian Normal University,Fuzhou 350007,China)
  • Received:2015-03-09 Online:2016-05-15 Published:2016-05-13

摘要: 现有序列相似性度量算法在子序列相似性度量中仅考虑其局部相似度,忽略了其所属序列的整体结构信息。为此,提出一种以单个符号的熵为基础的序列相似性度量方法,根据同一序列中相同符号的位置及个数信息得出符号熵。通过凝聚型层次聚类结果验证序列相似性度量方法,在多个领域的符号序列数据集上的实验结果表明,与现有的基于子序列局部相似性方法相比,该相似性度量方法有效提高了聚类结果质量。

关键词: 符号序列, 相似度, 熵, 层次聚类, 序列聚类

Abstract: Existing sequence similarity measurement algorithms only consider the local similarity of subsequences,ignoring global structure information.Thus,a similarity measurement method based on the entropy of single symbol for sequences is proposed.The entropy of a symbol is computed according to the positions and numbers of all the same symbols in a sequence.Through verifying the validity of the new sequence similarity measurement method by agglomerative hierarchical clustering,experimental results on a plurality of datasets show that,compared with the existing methods based on local similarity of substring,the new similarity measurement method can improve the clustering accuracy significantly.

Key words: symbol sequence, similarity, entropy, hierarchical clustering, sequence clustering

中图分类号: