基于相关族的半监督特征选择

doi:10.19678/j.issn.1000-3428.0260007

摘要/Abstract

摘要： 半监督特征选择是机器学习中处理大规模部分标记数据的有效工具。然而，大多数现有特征选择算法面临着计算效率不足、数据规模有限以及精度不够等挑战。相关族是一个基于粒计算的高效特征选择框架，在大规模数据场景下具备效率优势，但无法处理部分标签数据。为此，本文提出一种基于相关族的半监督特征选择算法（SRF）。首先，提出一种无冗余粒化方法——一致粒化，以及重要度矩阵来构建一种新型的相关族，进而设计了一种半监督特征评价方法，将特征评价的复杂度从二次降低到线性，有效克服了计算效率和规模方面的瓶颈；其次，为进一步提高分类性能，本文采用了三种策略：一是增强信息粒的数据表示能力；二是兼顾信息粒一致性和质量评价特征重要度；三是基于筛选后的高质量特征子集预测伪标签，降低噪声干扰。在12个公开数据集上的实验结果表明，与SemiFREE、Semi2MNR、LMSFS及GMSFS四种代表性算法相比，SRF在分类精度上分别提升了0.88%、2.34%、2.81%和2.58%，同时在计算效率上分别提升了36.70倍、841.56倍、6.52倍和17.04倍，验证了该方法处理大规模部分标签数据的有效性与高效性。

Abstract: Semi-supervised feature selection is a powerful tool in machine learning for processing large-scale partially labeled data. However, most existing feature selection algorithms are hindered by challenges such as insufficient computational efficiency, limited scalability, and inadequate accuracy. Related family is a high-efficiency feature selection framework based on granular computing; while it excels in large-scale data scenarios, it remains incapable of handling partially labeled data. To address this, this paper proposes a semi-supervised algorithm based on related family (SRF). First, a redundancy-free granulation method, termed consistent granulation, and a importance degree matrix are introduced to construct a novel related family. This facilitates the design of a semi-supervised feature evaluation method that reduces the complexity from quadratic to linear, effectively overcoming bottlenecks in computational efficiency and scale. Second, to further enhance classification performance, three strategies are implemented: 1) strengthening the data representation capability of information granules; 2) it balances the consistency and the quality of information granules,which are jointly used to evaluate feature importance; and 3) predicting pseudo-labels based on the selected high-quality feature subset to reduce noise interference. Experimental results on 12 public datasets demonstrate that, compared with four representative algorithms—SemiFREE, Semi2MNR, LMSFS, and GMSFS, SRF improves the classification accuracy by 0.88%, 2.34%, 2.81%, and 2.58% respectively. Meanwhile, it enhances the computational efficiency by 36.70 times, 841.56 times, 6.52 times, and 17.04 times respectively. These results verify the effectiveness and efficiency of the proposed method in handling large-scale partially labeled data.

郑康怡, 张霁, 林炳宇, 杨田, 刘宁怡. 基于相关族的半监督特征选择[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0260007.

Kangyi Zheng, Ji Zhang , Bingyu Lin , Tian Yang Ningyi Liu. Semi-Supervised Feature Selection Based on Related Family[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0260007.

参考文献

[1] Kohavi R, John G H. Wrappers for feature subset selection[J]. Artificial Intelligence, 1997, 97(1-2): 273-324.
[2] 徐波，张贤勇，冯山.邻域粗糙集的加权依赖度及其启发式约简算法［J］.模式识别与人工智能， 2018，31（3）：256-264. XU，ZHANG X Y，FENG S.Weighted dependence of neighborhood rough sets and its heuristic eduction algorithm [J].Pattern Recognition and Artificial Intelligence，2018，31（3）：256-264.
[3] Li W, Zhou H, Xu W, Wang X Z, Pedrycz W. Interval Dominance-Based Feature Selection for Interval-Valued Ordered Data[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34(10): 6898-6912.
[4] Bengio S, Bengio Y. Taking on the curse of dimensionality in joint distributions using neural networks[J]. IEEE Transactions on Neural Networks, 2000, 11(3): 550-557.
[5] Friedman J H. On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality[J]. Data Mining and Knowledge Discovery, 1997, 1: 55-77.
[6] 王俊, 赖会霞, 万玥, 张仕. 基于角度的图神经网络高维数据异常检测方法[J].计算机工程, 2024, 50(3): 156-165. Jun WANG, Huixia LAI, Yue WAN, Shi ZHANG. Angle-based Graph Neural Network Method for Anomaly Detection in High Dimensional Data[J]. Computer Engineering, 2024, 50(3): 156-165.
[7] Xu J, Ren G, Tang J, Ding W, Wang G. Selecting Central and Divergent Samples via Leading Tree Metric Space for Semi-supervised Learning[J]. IEEE Transactions on Fuzzy Systems, 2025, 33(5): 1578-1591.
[8] Lv S, Shi S, Wang H, Li F. Semi-supervised multi-label feature selection with adaptive structure learning and manifold learning[J]. Knowledge-Based Systems, 2021, 214: 106757.
[9] Wang F, Wu W Q, Liang J Y. 面向开放世界的半监督特征选择算法 [J]. 计算机学报，2025, 48 (6). Wang F, Wu W Q, Liang J Y. Semi-supervised feature selection algorithm for open-world[J]. Journal of Computer Science and Technology, 2025, 48(6).
[10] Pawlak Z. Rough sets[J]. International Journal of Computer and Information Sciences, 1982, 11: 341-356.
[11] Yang T, Li Y J, Qian Y, Wang F Y. Consistent Matrix: A Feature Selection Framework for Large-Scale Datasets[J]. IEEE Transactions on Fuzzy Systems, 2023, 31(11): 4024-4038.
[12] Xia S, et al. GBRS: A Unified Granular-Ball Learning Model of Pawlak Rough Set and Neighborhood Rough Set[J]. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36(1): 1719-1733.
[13] Dai J, Hu Q, Zhang J, Hu H, Zheng N. Attribute selection for partially labeled categorical data by rough set approach[J]. IEEE Transactions on Cybernetics, 2016, 47(9): 2460–2471.
[14] Qian Y, Liang X, Wang Q, Liang J, Liu B, Skowron A, Yao Y, Ma J, Dang C. Local rough set: A solution to rough data analysis in big data[J]. International Journal of Approximate Reasoning, 2018, 97: 38–63.
[15] Yang T, Deng Y, Yu B, Qian Y, Dai J. Local feature selection for large-scale data sets with limited labels[J]. IEEE Transactions on Knowledge and Data Engineering, 2023, 35(7): 7152-7163.
[16] Dai J, Huang W, Wang W, Zhang C. Semi-supervised attribute reduction based on label distribution and label irrelevance[J]. Information Fusion, 2023, 100: 101951.
[17] Zhou N, Liao S, Chen H, Ding W, Lu Y. Semi-Supervised Feature Selection With Multi-Scale Fuzzy Information Fusion: From Both Global and Local Perspectives[J]. IEEE Transactions on Fuzzy Systems, 2025.
[18] Deng Z, Li T, Deng D, Liu K, Luo Z, Zhang P. Feature Selection for Handling Label Ambiguity Using Weighted Label-Fuzzy Relevancy and Redundancy[J]. IEEE Transactions on Fuzzy Systems, 2024, 32(8): 4436–4447.
[19] An S, Gong Y, Wang C, Guo G. Soft-neighborhood based robust fuzzy rough sets for semi-supervised feature selection[J]. Fuzzy Sets and Systems, 2025, 513.
[20] Shu W, Yu J, Yan Z, Qian W. Semi-supervised feature selection for partially labeled mixed-type data based on multi-criteria measure approach[J]. International Journal of Approximate Reasoning, 2023, 153: 258–279.
[21] Pang Q Q, Zhang L. Semi-supervised neighborhood discrimination index for feature selection[J]. Knowledge-Based Systems, 2020, 204: 106224.
[22] 郭涛, 李贵洋, 兰霞. 基于图的半监督双域训练算法[J].计算机工程, 2012, 38(13): 163-165,168. GUO Chao, LI Gui-Xiang, LAN Xia. Semi-supervised Collaborative Training Algorithm Based on Graph[J]. Computer Engineering, 2012, 38(13): 163-165,168.
[23] Skowron A, Rauszer C. The Discernibility Matrices and Functions in Information Systems[M]//Słowiński R. Intelligent Decision Support: Handbook of Applications and Advances of the Rough Sets Theory. Dordrecht, Netherlands: Springer, 1992: 331–362.
[24] Chen D, Zhao S, Zhang L, Yang Y, Zhang X. Sample pair selection for attribute reduction with rough set[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(11): 2080–2093.
[25] Battiti R. Using mutual information for selecting features in supervised neural net learning[J]. IEEE Transactions on Neural Networks, 1994, 5(4): 537–550.
[26] Peng H, Long F, Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226–1238.
[27] Hu Q, Yu D, Liu J, Wu C. Neighborhood rough set based heterogeneous feature subset selection[J]. Information Sciences, 2008, 178(18): 3577–3594.
[28] Fan X, Zhao W, Wang C, Huang Y. Attribute reduction based on max-decision neighborhood rough set model[J]. Knowledge-Based Systems, 2018, 151: 16–23.
[29] Yang T, Li Q, Zhou B. Related family: A new method for attribute reduction of covering information systems[J]. Information Sciences, 2013, 228: 175–191.
[30] Xia S, Wang W, Qian Y, Tu Y, Qian W. An Efficient and Accurate Rough Set for Feature Selection, Classification, and Knowledge Representation[J]. IEEE Transactions on Knowledge and Data Engineering, 2023, 35(8): 7724–7735.
[31] Quan J, Qiao F, Yang T, Shen S, Qian Y. Biselection Method Based on Consistent Matrix for Large-Scale Datasets[J]. IEEE Transactions on Fuzzy Systems, 2025, 33(6): 1992–2005.
[32] Yang T, Zhong X, Lang G, Qian Y, Dai J. Granular Matrix: A New Approach for Granular Structure Reduction and Redundancy Evaluation[J]. IEEE Transactions on Fuzzy Systems, 2020, 28(12): 3133–3144.
[33] Yang T, Shen S, Cheng H, Deng J, Liang J, Qian Y, Dai J. Feature selection based on consistent granulation[J]. Information Sciences, 2025, 694: 121661.
[34] Liu K, Li T, Yang X, Chen H, Wang J, Deng Z. SemiFREE: Semi-supervised feature selection with fuzzy relevance and redundancy[J]. IEEE Transactions on Fuzzy Systems, 2023, 31(10): 3384–3396.
[35] Qian D, Liu K, Zhang S, et al. Semi-supervised feature selection by minimum neighborhood redundancy and maximum neighborhood relevancy[J]. Applied Intelligence, 2024, 54: 7750–7764.
[36] N. Zhou, S. Liao, H. Chen, W. Ding, Y. Lu. Semisupervised Feature Selection With Multiscale Fuzzy Information Fusion: From Both Global and Local Perspectives[J]. IEEE Transactions on Fuzzy Systems, 2025, 33(6): 1825-1839.

选择文件类型/文献管理软件名称

选择包含的内容