作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (7): 135-142. doi: 10.19678/j.issn.1000-3428.0065390

• 人工智能与模式识别 • 上一篇    下一篇

基于邻域互信息的高维时序数据特征选择

杨璇, 马建敏*, 赵曼君   

  1. 长安大学 理学院, 西安 710064
  • 收稿日期:2022-07-28 出版日期:2023-07-15 发布日期:2022-10-12
  • 通讯作者: 马建敏
  • 作者简介:

    杨璇(1997—),女,硕士研究生,主研方向为数据分析的统计学方法

    赵曼君,硕士研究生

  • 基金资助:
    国家自然科学基金(61772019)

Feature Selection of High-Dimensional Time-Series Data Based on Neighborhood Mutual Information

Xuan YANG, Jianmin MA*, Manjun ZHAO   

  1. School of Science, Chang'an University, Xi'an 710064, China
  • Received:2022-07-28 Online:2023-07-15 Published:2022-10-12
  • Contact: Jianmin MA

摘要:

特征选择作为一种数据预处理方法,主要目的是消除冗余和不相关属性,保留性能显著的属性,从而提高模型精度且降低计算复杂度。传统的特征选择方法多基于截面数据,对于实际生活中大量存在的高维时序数据的研究较少。现有特征选择算法并未考虑属性间相互依赖的影响,导致分类性能下降。为此,提出基于邻域互信息的高维时序数据特征选择方法。构建时序信息系统,提出时序邻域关系,并引入该关系下的时序邻域熵、时序邻域联合熵、时序邻域互信息等信息度量。在最近最远邻特征选择算法(算法1)中引入高维时序数据,定义属性重要度,以确定分类性能较优的特征,通过引入累计重要度贡献率控制特征选择规模。设计最近最远邻邻域互信息特征选择算法(算法2),根据阈值得到分类能力强的初始特征集,进一步由时序邻域互信息定义属性冗余度,去除初始特征集中重要度最低、依赖程度最大的属性,得到最终特征子集。在UCR数据集上的实验结果表明,相比原始数据和所提算法1,所提算法2在最佳取值范围和分类精度上分别平均提升13.69%和6.70%,对于处理高维时序数据的特征选择具有一定的有效性和优越性。

关键词: 高维时序数据, 粗糙集, 邻域关系, 邻域互信息, 最近最远邻, 特征选择

Abstract:

As a data preprocessing method, the main aim of feature selection is to improve model accuracy and reduce computational complexity by eliminating redundant and irrelevant attributes, thereby retaining attributes with significant performance. Traditional feature selection methods are mostly based on cross-sectional data, and there is little research on large amounts of real-life high-dimensional time-series data. Existing feature selection algorithms do not consider the influence of interdependence between attributes, which results in a decrease in classification performance. Therefore, this study proposes feature selection method of high-dimensional time-series data based on neighborhood mutual information. A time-series information system is constructed to propose time-series neighborhood relationships, whereby time-series neighborhood entropy, neighborhood joint entropy, neighborhood mutual information, and other information metrics are introduced under the relationship. The nearest and farthest neighbor feature selection algorithm(algorithm 1) is introduced into high-dimensional time-series data to define attribute importance and determine the features with good classification performance. Cumulative importance contribution rate is introduced to control the scale of feature selection. The nearest and farthest neighbor mutual information feature selection algorithm(algorithm 2) is proposed, and get the initial feature set with strong classification ability according to the threshold. In addition, attribute redundancy is defined by time-series neighborhood mutual information, and the attributes with the lowest importance and the largest dependence in the initial feature set are removed to obtain the final feature subset. The experimental results on the UCR dataset show that compared to the original data and the proposed algorithm 1, the proposed algorithm 2 provides an average improvement of 13.69% and 6.70% for the optimal value range and classification accuracy evaluation indicators, respectively. The proposed method is effective and superior in processing high-dimensional time-series data for feature selection.

Key words: high-dimensional time-series data, rough set, neighborhood relationship, neighborhood mutual information, nearest and farthest neighbor, feature selection