作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于相关族的半监督特征选择

  • 发布日期:2026-04-14

Semi-Supervised Feature Selection Based on Related Family

  • Published:2026-04-14

摘要: 半监督特征选择是机器学习中处理大规模部分标记数据的有效工具。然而,大多数现有特征选择算法面临着计算效率不足、数据规模有限以及精度不够等挑战。相关族是一个基于粒计算的高效特征选择框架,在大规模数据场景下具备效率优势,但无法处理部分标签数据。为此,本文提出一种基于相关族的半监督特征选择算法(SRF)。首先,提出一种无冗余粒化方法——一致粒化,以及重要度矩阵来构建一种新型的相关族,进而设计了一种半监督特征评价方法,将特征评价的复杂度从二次降低到线性,有效克服了计算效率和规模方面的瓶颈;其次,为进一步提高分类性能,本文采用了三种策略:一是增强信息粒的数据表示能力;二是兼顾信息粒一致性和质量评价特征重要度;三是基于筛选后的高质量特征子集预测伪标签,降低噪声干扰。在12个公开数据集上的实验结果表明,与SemiFREE、Semi2MNR、LMSFS及GMSFS四种代表性算法相比,SRF在分类精度上分别提升了0.88%、2.34%、2.81%和2.58%,同时在计算效率上分别提升了36.70倍、841.56倍、6.52倍和17.04倍,验证了该方法处理大规模部分标签数据的有效性与高效性。

Abstract: Semi-supervised feature selection is a powerful tool in machine learning for processing large-scale partially labeled data. However, most existing feature selection algorithms are hindered by challenges such as insufficient computational efficiency, limited scalability, and inadequate accuracy. Related family is a high-efficiency feature selection framework based on granular computing; while it excels in large-scale data scenarios, it remains incapable of handling partially labeled data. To address this, this paper proposes a semi-supervised algorithm based on related family (SRF). First, a redundancy-free granulation method, termed consistent granulation, and a importance degree matrix are introduced to construct a novel related family. This facilitates the design of a semi-supervised feature evaluation method that reduces the complexity from quadratic to linear, effectively overcoming bottlenecks in computational efficiency and scale. Second, to further enhance classification performance, three strategies are implemented: 1) strengthening the data representation capability of information granules; 2) it balances the consistency and the quality of information granules,which are jointly used to evaluate feature importance; and 3) predicting pseudo-labels based on the selected high-quality feature subset to reduce noise interference. Experimental results on 12 public datasets demonstrate that, compared with four representative algorithms—SemiFREE, Semi2MNR, LMSFS, and GMSFS, SRF improves the classification accuracy by 0.88%, 2.34%, 2.81%, and 2.58% respectively. Meanwhile, it enhances the computational efficiency by 36.70 times, 841.56 times, 6.52 times, and 17.04 times respectively. These results verify the effectiveness and efficiency of the proposed method in handling large-scale partially labeled data.