作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (10): 88-94. doi: 10.19678/j.issn.1000-3428.0062832

• 人工智能与模式识别 • 上一篇    下一篇

基于子序列相似性的时间序列语义挖掘算法

陆怡1, 王鹏2, 汪卫2   

  1. 1. 复旦大学 软件学院, 上海 201203;
    2. 复旦大学 计算机科学技术学院, 上海 201203
  • 收稿日期:2021-09-28 修回日期:2021-11-26 发布日期:2022-10-09
  • 作者简介:陆怡(1997—),女,硕士研究生,主研方向为时间序列数据分析与挖掘;王鹏、汪卫,教授、博士生导师。
  • 基金资助:
    国家重点研发计划(2020YFB1710001)。

Time-Series Semantic Mining Algorithm Based on Sub-Series Similarity

LU Yi1, WANG Peng2, WANG Wei2   

  1. 1. School of Software, Fudan University, Shanghai 201203, China;
    2. School of Computer Science, Fudan University, Shanghai 201203, China
  • Received:2021-09-28 Revised:2021-11-26 Published:2022-10-09

摘要: 时间序列是对某个事物或系统进行连续同间隔测量得到的数值序列,挖掘时间序列中潜在的语义信息对于发现系统运行规律或识别系统突发异常至关重要,然而目前多数时间序列语义挖掘算法对于时间序列数据特征有一定的约束条件,难以处理海量且特征各异的时间序列数据。针对该问题,提出一种基于子序列相似性的时间序列语义挖掘算法。通过计算子序列的相似性,将时间序列分割成片段序列进行两级聚类,识别出时间序列中潜在的物理状态。引入基于概率的迭代模式,根据候选分段情况动态调整子序列被选为参考子序列的概率,保证参考子序列涵盖全部物理状态。实验结果表明,该算法在PAMAP、Barbet等5个真实数据集上的识别准确率均超过90%,相比于FLUSS、pHMM、AutoPlait算法具有更高的识别准确率与运行效率以及更强的通用性。

关键词: 时间序列, 语义挖掘, 相似性度量, 聚类, k最近邻

Abstract: Time-series is a sequence of values obtained by continuously measuring an object or system at the same interval.By obtaining potential semantic information in the time-series, the regularities or anomalies of a system can be identified, which can provide guidance for practice and analysis.However, most current time-series semantic mining algorithms are constrained by some of the characteristics of time-series data, and addressing a significant amount of time-series data with different characteristics is difficult.Hence, a time-series semantic mining algorithm based on sub-series similarity is proposed herein.First, by calculating the similarity of sub-series, the algorithm partitions the time-series into segment sequences for two-level clustering and identifies the underlying physical states in the time-series.Second, the algorithm introduces an iterative mode based on probability, dynamically adjusts the probability of a sub-series selected as a reference sub-series based on the candidate segmentation, and ensures that the reference sub-series includes all physical states.Experimental results show that the recognition accuracy of the algorithm on five real data sets such as PAMAP and Barbet exceeds 90%.Compared with FLUSS, pHMM, and AutoPlait algorithms, the proposed algorithm demonstrates higher recognition accuracy, operating efficiency, and versatility.

Key words: time-series, semantic mining, similarity measurement, clustering, k Nearest Neighbor(kNN)

中图分类号: