计算机工程 ›› 2019, Vol. 45 ›› Issue (2): 303-309,314.doi: 10.19678/j.issn.1000-3428.0049701

• 开发研究与工程应用 • 上一篇    下一篇

基于差分进化的两阶段文本特征选择算法

肖晓丽a,b,吴瑶a,b,周锡玲a,b,廖卓凡a,b   

  1. 长沙理工大学 a.计算机与通信工程学院; b.综合交通运输大数据智能处理湖南省重点实验室,长沙 410114
  • 收稿日期:2017-12-14 出版日期:2019-02-15 发布日期:2019-02-15
  • 作者简介:肖晓丽(1965—),女,教授,主研方向为数据挖掘、网络安全、移动通信、数据库;吴瑶,硕士研究生;周锡玲,硕士;廖卓凡,博士。
  • 基金项目:

    国家自然科学基金(61402056)。

Two-stage Text Feature Selection Algorithm Based on Differential Evolution

XIAO Xiaolia,b,WU Yaoa,b,ZHOU Xilinga,b,LIAO Zhuofana,b   

  1. a.College of Computer and Communication Engineering; b.Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation,Changsha University of Science and Technology,Changsha 410114,China
  • Received:2017-12-14 Online:2019-02-15 Published:2019-02-15

摘要:

为降低文本特征空间维度,提高数据挖掘处理数据的效率,提出两阶段文本特征选择算法。结合方差和平均中位数2种方法构建高相关性的特征子集进行初步降维,并将其作为差分进化算法的初始特征种群。利用特征词的累计词频和文档频率设计适应度函数,将多个特征差向量和局部最优特征引入变异操作中,增加特征子集的扰动性,加快差分进化算法的收敛速度,获得最优特征子集。在WebKB和Reuters-21578数据集上进行实验,结果表明,该算法在准确率、召回率和F1值上均优于TDM5、MADAC等算法,能够降低文本特征空间的维度,提高文本聚类效果。

关键词: 混合特征选择, 降维, 差分进化算法, 方差, 平均中位数, 文本聚类

Abstract:

In order to reduce the text feature space dimension and improve the efficiency of data mining processing data,a two-stage text feature selection algorithm is proposed.By combining the variance and the mean median to construct a high-correlation feature subset,the initial dimension reduction is performed as the initial feature population of the differential evolution algorithm.Then the differential evolution algorithm is improved.By using the cumulative word frequency of the feature words and the document frequency to design the fitness function,multiple feature difference vectors and local optimal features are introduced into the mutation operation,which increases the perturbation of the feature subset and accelerates the differential evolution algorithm.The convergence speed is obtained to obtain the optimal feature subset.Simulation experiments on the WebKB and Reuters-21578 datasets show that the algorithm can improve the clustering accuracy,recall rate and F1 value based on the effective reduction of the text feature space dimension.

Key words: hybrid feature selection, dimension reduction, Differential Evolutionary(DE) algorithm, variance, mean median, text clustering

中图分类号: