作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (2): 113-124. doi: 10.19678/j.issn.1000-3428.0060594

• 人工智能与模式识别 • 上一篇    下一篇

基于文本分类的Fisher Score快速多标记特征选择算法

汪正凯1, 沈东升2, 王晨曦2   

  1. 1. 福建省粒计算及其应用重点实验室, 福建 漳州 363000;
    2. 闽南师范大学计算机学院, 福建 漳州 363000
  • 收稿日期:2021-01-14 修回日期:2021-02-23 发布日期:2021-02-26
  • 作者简介:汪正凯(1995-),男,硕士研究生,主研方向为多标记学习、机器学习;沈东升、王晨曦,副教授、硕士。
  • 基金资助:
    福建省自然科学基金(2020J01811)。

Fisher Score Fast Multi-Label Feature Selection Algorithm Based on Text Classification

WANG Zhengkai1, SHEN Dongsheng2, WANG Chenxi2   

  1. 1. Fujian Key Laboratory of Granular Computing and Application, Zhangzhou, Fujian 363000, China;
    2. College of Computer, Minnan Normal University, Zhangzhou, Fujian 363000, China
  • Received:2021-01-14 Revised:2021-02-23 Published:2021-02-26

摘要: Fisher Score (FS)是一种快速高效的评价特征分类能力的指标,但传统的FS指标既无法直接应用于多标记学习,也不能有效处理样本极值导致的类中心与实际类中心的误差。提出一种结合中心偏移和多标记集合关联性的FS多标记特征选择算法,找出不同标记下每类样本的极值点,以极值点到该类样本的中心距离乘以半径系数筛选新的样本,从而获得分布更为密集的样本集合,以此计算特征的FS得分,通过整体遍历全体样本的标记集合中的每个标记,并在遍历过程中针对具有更多标记数量的样本自适应地赋以标记权值,得到整体特征的平均FS得分,以特征的FS得分进行排序过滤出目标子集实现特征选择目标。在8个公开的多标记文本数据集上进行参数分析及5种指标性能比较,结果表明,该算法具有一定的有效性和鲁棒性,在多数指标上优于MLNB、MLRF、PMU、MLACO等多标记特征选择算法。

关键词: 多标记分类, 特征选择, Fisher Score指标, 距离度量, 类间散度

Abstract: Fisher Score(FS) is a fast and efficient indicator to evaluate feature classification performance.However, the traditional FS indicator can not be directly applied to multi-label learning, nor effectively deal with the error between the class center and the actual class center caused by the sample extreme value.This paper proposes a FS-based multi-label feature selection algorithm that combines centroid shift and multi-label set association.The algorithm finds out the extremum points of each class of samples under different labels, and then multiplies the radius coefficient and the distance from extremum point to center of the class of samples, so as to obtain a more densely distributed sample set.On this basis, the FS of the features is calculated.Then the algorithm traverses each label in the label set of all samples.For those samples with multiple labels, the algorithm adaptively weights the labels in the process of traversal, and thus obtains the average FS of all features.Then the scores are sorted out to filter out the target subset to achieve the goal of feature selection.The proposed algorithm is tested on 8 public multi-label text datasets for parameter analysis, and compared with other algorithms in terms of 5 performance indicators.Results show that the proposed algorithm displays certain effectiveness and robustness, and outperforms MLNB, MLRF, PMU, MLACO and other multi-label feature selection algorithms on most of the indicators.

Key words: multi-label classification, feature selection, Fisher Score(FS) index, distance measure, inter-class divergence

中图分类号: