基于差分进化的两阶段文本特征选择算法

doi:10.19678/j.issn.1000-3428.0049701

计算机工程 ›› 2019, Vol. 45 ›› Issue (2): 303-309,314. doi: 10.19678/j.issn.1000-3428.0049701

基于差分进化的两阶段文本特征选择算法

肖晓丽^a,b,吴瑶^a,b,周锡玲^a,b,廖卓凡^a,b

长沙理工大学 a.计算机与通信工程学院; b.综合交通运输大数据智能处理湖南省重点实验室,长沙 410114

收稿日期:2017-12-14 出版日期:2019-02-15 发布日期:2019-02-15
作者简介:肖晓丽(1965—),女,教授,主研方向为数据挖掘、网络安全、移动通信、数据库;吴瑶,硕士研究生;周锡玲,硕士;廖卓凡,博士。
基金资助:
国家自然科学基金(61402056)。

Two-stage Text Feature Selection Algorithm Based on Differential Evolution

XIAO Xiaoli^a,b,WU Yao^a,b,ZHOU Xiling^a,b,LIAO Zhuofan^a,b

a.College of Computer and Communication Engineering; b.Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation,Changsha University of Science and Technology,Changsha 410114,China

Received:2017-12-14 Online:2019-02-15 Published:2019-02-15

摘要/Abstract

摘要：

为降低文本特征空间维度,提高数据挖掘处理数据的效率,提出两阶段文本特征选择算法。结合方差和平均中位数2种方法构建高相关性的特征子集进行初步降维,并将其作为差分进化算法的初始特征种群。利用特征词的累计词频和文档频率设计适应度函数,将多个特征差向量和局部最优特征引入变异操作中,增加特征子集的扰动性,加快差分进化算法的收敛速度,获得最优特征子集。在WebKB和Reuters-21578数据集上进行实验,结果表明,该算法在准确率、召回率和F1值上均优于TDM5、MADAC等算法,能够降低文本特征空间的维度,提高文本聚类效果。

关键词: 混合特征选择, 降维, 差分进化算法, 方差, 平均中位数, 文本聚类

Abstract:

In order to reduce the text feature space dimension and improve the efficiency of data mining processing data,a two-stage text feature selection algorithm is proposed.By combining the variance and the mean median to construct a high-correlation feature subset,the initial dimension reduction is performed as the initial feature population of the differential evolution algorithm.Then the differential evolution algorithm is improved.By using the cumulative word frequency of the feature words and the document frequency to design the fitness function,multiple feature difference vectors and local optimal features are introduced into the mutation operation,which increases the perturbation of the feature subset and accelerates the differential evolution algorithm.The convergence speed is obtained to obtain the optimal feature subset.Simulation experiments on the WebKB and Reuters-21578 datasets show that the algorithm can improve the clustering accuracy,recall rate and F1 value based on the effective reduction of the text feature space dimension.

Key words: hybrid feature selection, dimension reduction, Differential Evolutionary(DE) algorithm, variance, mean median, text clustering

中图分类号:

TP391

肖晓丽,吴瑶,周锡玲,廖卓凡. 基于差分进化的两阶段文本特征选择算法[J]. 计算机工程, 2019, 45(2): 303-309,314.

XIAO Xiaoli,WU Yao,ZHOU Xiling,LIAO Zhuofan. Two-stage Text Feature Selection Algorithm Based on Differential Evolution[J]. Computer Engineering, 2019, 45(2): 303-309,314.

https://www.ecice06.com/CN/Y2019/V45/I2/303

参考文献

［1］YANG Y,PEDERSEN J O.A comparative study on feature selection in text categorization［C］//Proceedings of the 14th International Conference on Machine Learning.San Francisco,USA:Morgan Kaufmann Publishers Inc.,1997:412-420.
［2］LIU T,LIU S,CHEN Z,et al.An evaluation on feature selection for text clustering ［C］//Proceedings of the 20th International Conference on Machine Learning.［S.l.］:AAAI Press,2003:488-495.
［3］LIU L,KANG J,YU J,et al.A comparative study on unsupervised feature selection methods for text clustering ［C］//Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering.Washington D.C.,USA:IEEE Press,2005:597-601.
［4］FERREIRA A J,FIGUEIREDO M A T.Efficient feature selection filters for high-dimensional data ［J］.Pattern Recognition Letters,2012,33(13):1794-1804.
［5］UYSAL A K,GUNAL S.Text classification using genetic algorithm oriented latent semantic features ［J］.Expert Systems with Applications,2014,41(13):5938-5947.
［6］MORADI P,ROSTAMI M.Integration of graph clustering with ant colony optimization for feature selection ［J］.Knowledge-Based Systems,2015,84:144-161.
［7］ZAHRAN B M,KANAAN G.Text feature selection using particle swarm optimization algorithm ［J］.World Applied Sciences Journal,2009,7:69-74.
［8］STORN R,PRICE K.Differential evolution-a simple and efficient heuristic for global optimization over continuous spaces ［J］.Journal of Global Optimization,1997,11(4):341-359.
［9］ZHAO Z,YANG J,HU Z,et al.A differential evolution algorithm with self-adaptive strategy and control parameters based on symmetric Latin hypercube design for unconstrained optimization problems ［J］.European Journal of Operational Research,2016,250(1):30-45.
［10］姜凯,苑金海.融合差分进化和SOM的组合文本聚类算法［J］.计算机与现代化,2015(5):13-16.
［11］樊东辉,王治和,陈建华,等.基于DF算法改进的文本聚类特征选择算法［J］.兰州文理学院学报(自然科学版),2012,26(1):51-54.
［12］BHARTI K K,SINGH P K.A two-stage unsupervised dimension reduction method for text clustering ［C］//Proceedings of the 7th International Conference on Bio-Inspired Computing:Theories and Applications.Berlin,Germany:Springer,2013:529-542.
［13］AGRAWAL R,GEHRKE J,GUNOPULOS D,et al.Automatic subspace clustering of high dimensional data for data mining applications［C］//Proceedings of ACM SIGMOD International Conference on Management of Data.New York,USA:ACM Press,1998:94-105.
［14］BHARTI K K,SINGH P K.Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering ［J］.Expert Systems with Applications,2015,42(6):3105-3114.
［15］ZORARPACI E,ZEL S A.A hybrid approach of differential evolution and artificial bee colony for feature selection ［J］.Expert Systems with Applications,2016,62:91-103.
［16］QIN A K,HUANG V L,SUGANTHAN P N.Differential evolution algorithm with strategy adaptation for global numerical optimization ［J］.IEEE Transactions on Evo-lutionary Computation,2009,13(2):398-417.

[1]	赵俊涛, 李陶深, 卢志翔. 基于最优近邻的局部保持投影方法[J]. 计算机工程, 2024, 50(9): 161-168.
[2]	万昊楠, 吴飞, 尹玲. 基于自适应无人机数量的节时部署优化算法[J]. 计算机工程, 2024, 50(10): 302-312.
[3]	黄聪, 邹耀斌, 孙水发. 圆形直方图线性化的高精度高适应性多阈值分割方法[J]. 计算机工程, 2024, 50(1): 259-270.
[4]	陈君航, 杨祖元, 刘名扬, 李陵江. 基于正交约束的广义可分离非负矩阵分解算法[J]. 计算机工程, 2023, 49(8): 46-53.
[5]	霍跃华, 赵法起. 基于Stacking与多特征融合的加密恶意流量检测[J]. 计算机工程, 2023, 49(5): 165-172,180.
[6]	古楠楠. 针对标签噪声数据的自步半监督降维[J]. 计算机工程, 2023, 49(11): 131-142.
[7]	郑秋梅, 徐林康, 王风华, 林超. 基于改进自注意力机制的金字塔场景解析网络[J]. 计算机工程, 2023, 49(1): 242-249.
[8]	邱鸿辉, 刘海林, 陈磊. 基于协方差矩阵调整的多目标多任务优化算法[J]. 计算机工程, 2022, 48(8): 306-312.
[9]	生龙, 袁丽娜, 武南南, 姬少培. 基于GSA与DE优化混合核ELM的网络异常检测模型[J]. 计算机工程, 2022, 48(6): 146-153.
[10]	李晋国, 焦旭斌. 雾计算环境下入侵检测模型研究[J]. 计算机工程, 2022, 48(5): 43-52.
[11]	张恒, 陈晓红, 蓝宇翔, 李舜酩. 基于深度学习的监督型典型相关分析[J]. 计算机工程, 2022, 48(5): 222-228.
[12]	于成龙, 付国霞, 孙超利, 张国晨. 全局与局部模型交替辅助的差分进化算法[J]. 计算机工程, 2022, 48(3): 115-123.
[13]	许伟佳, 秦永彬, 黄瑞章, 陈艳平. 基于DMA与特征划分的多源文本主题模型[J]. 计算机工程, 2021, 47(7): 59-66.
[14]	陶洋, 鲍灵浪, 胡昊. 结合表示学习与嵌入子空间学习的降维方法[J]. 计算机工程, 2021, 47(6): 83-87,97.
[15]	刘彦雯, 张金鑫, 张宏杰, 经玲. 基于双重局部保持的不完整多视角嵌入学习方法[J]. 计算机工程, 2021, 47(6): 115-122,141.

选择文件类型/文献管理软件名称

选择包含的内容

基于差分进化的两阶段文本特征选择算法

Two-stage Text Feature Selection Algorithm Based on Differential Evolution

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于差分进化的两阶段文本特征选择算法

Two-stage Text Feature Selection Algorithm Based on Differential Evolution

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价