计算机工程 ›› 2018, Vol. 44 ›› Issue (11): 1-6.doi: 10.19678/j.issn.1000-3428.0048265

• 先进计算与数据处理 • 上一篇    下一篇

基于Spark的肿瘤基因混合特征选择方法

汪丽丽1,2,邓丽1,2,余玥1,2,费敏锐1,2   

  1. 1.上海大学 机电工程与自动化学院,上海 200072; 2.上海市电站自动化技术重点实验室,上海 200072
  • 收稿日期:2017-08-07 出版日期:2018-11-15 发布日期:2018-11-15
  • 作者简介:汪丽丽(1994—),女,硕士研究生,主研方向为机器学习、分布式计算;邓丽,副教授; 余玥,硕士研究生; 费敏锐,教授.
  • 基金项目:

    上海市科委重点项目(14DZ1206302)

Hybrid Feature Selection Method for Tumor Gene Based on Spark

WANG Lili1,2,DENG Li1,2,YU Yue1,2 ,FEI Minrui1,2   

  1. 1.School of Mechatronics Engineering and Automation,Shanghai University,Shanghai 200072,China; 2.Shanghai Key Laboratory of Power Station Automation Technology,Shanghai 200072,China
  • Received:2017-08-07 Online:2018-11-15 Published:2018-11-15

摘要: 为处理随微阵列技术发展而急剧增长的肿瘤基因数据,实现对肿瘤基因数据的特征选择,结合集成特征选择和混合特征选择,提出一种Spark分布式计算框架的混合特征选择方法。利用F-score特征选择方法去除无关特征,进行初步特征选择,结合F-score、多分类支持向量机递归消除法、基于随机森林的特征选择3种方法得到最优的特征子集,并采用支持向量机对特征子集进行分类预测。实验结果表明,该方法能通过选择较少的基因达到较高的分类准确率。

关键词: 肿瘤基因数据, Spark分布式计算框架, 混合特征选择, 集成特征选择, 分类

Abstract: In order to deal with the tumor gene data which grows rapidly with the development of microarray technology,and achieve the feature selection of tumor gene data,combined with integrated feature selection and mixed feature selection,a hybrid feature selection method of Spark distributed computing framework is proposed.The F-score feature selection method is used to remove the extraneous features,and the preliminary feature selection is carried out.The optimal feature subsets are obtained by integrating F-score,multi-class support vector machine recursive elimination method and random forest based feature selection,and the feature subset is classified and predicted by support vector machine.Experimental results show that this method can select fewer genes to achieve higher classification accuracy.

Key words: tumor gene data, Spark distributed computing framework, hybrid feature selection, integrated feature selection, classification

中图分类号: