计算机工程 ›› 2019, Vol. 45 ›› Issue (8): 66-74.doi: 10.19678/j.issn.1000-3428.0052286

• 先进计算与数据处理 • 上一篇    下一篇

一种面向SNP选择的模糊聚类算法

张波1, 周从华1, 张付全2, 张婷3, 蒋跃明4   

  1. 1. 江苏大学 计算机科学与通信工程学院, 江苏 镇江 212013;
    2. 无锡市精神卫生中心, 江苏 无锡 214151;
    3. 无锡市妇幼保健院, 江苏 无锡 214002;
    4. 无锡市第五人民医院, 江苏 无锡 214073
  • 收稿日期:2018-08-03 修回日期:2018-10-08 出版日期:2019-08-15 发布日期:2019-08-08
  • 作者简介:张波(1993-),男,硕士研究生,主研方向为数据挖掘;周从华,教授、博士;张付全,博士后;张婷、蒋跃明,博士。
  • 基金项目:
    江苏省重点研发计划社会发展项目(BE2016630,BE2017628);无锡市卫生计生委科研项目(Z201603)。

A Fuzzy Clustering Algorithm for SNP Selection

ZHANG Bo1, ZHOU Conghua1, ZHANG Fuquan2, ZHANG Ting3, JIANG Yueming4   

  1. 1. School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, Jiangsu 212013, China;
    2. Wuxi Mental Health Center, Wuxi, Jiangsu 214151, China;
    3. Wuxi Hospital for Maternity and Child Health Care Hospital, Wuxi, Jiangsu 214002, China;
    4. Wuxi No.5 People's Hospital, Wuxi, Jiangsu 214073, China
  • Received:2018-08-03 Revised:2018-10-08 Online:2019-08-15 Published:2019-08-08

摘要: 在对高维少样本的遗传数据进行单核苷酸多态性(SNP)选择时,为能使所选SNP子集高度代表所有SNP信息,实现数据降维,在模糊C均值(FCM)算法的基础上提出一种改进方法GN-FCM。通过引入SNP权重因子量化SNP位点重要程度的差异性,同时将重点SNP邻域正则项引入模糊聚类的损失函数中,挖掘高度重要SNP与同邻域内其他SNP的关联性。实验结果表明,GN-FCM具有较好的收敛性,与DW-FCM算法相比,其构造的SNP子集在支持向量机、决策树和朴素贝叶斯分类中准确率分别提升5.73%、3.40%和3.79%,F1值分别提升4.01%、3.20%和2.22%。

关键词: 单核苷酸多态性选择, 模糊聚类, 特征选择, 支持向量机, 决策树, 朴素贝叶斯分类

Abstract: In the selection of Single Nucleotide Polymorphism(SNP) from high-dimensional genetic data with few samples,in order to make the selected SNP subset highly represent all SNP information and achieve data dimension reduction,an improved method is proposed on the basis of Fuzzy C-Mean(FCM) algorithm,which is named GN-FCM.By introducing the weight factor of SNP,the difference of importance degree of SNP site is quantified.Meanwhile,the neighborhood regular term of key SNP is introduced into the loss function of fuzzy clustering,so as to mine the correlation between highly important SNP and other SNPs in the neighborhood.Experimental results show that GN-FCM has better convergence.Compared with DW-FCM algorithm,the accuracy of the constructed SNP subsets by this algorithm in Support Vector Machine(SVM),Decision Tree(DT) and Naïve Bayesian(NB) classification is improved by 5.73%,3.40% and 3.79% respectively,and the F1 value is improved by 4.01%,3.20% and 2.22% respectively.

Key words: Single Nucleotide Polymorphism(SNP) selection, fuzzy clustering, feature selection, Support Vector Machine(SVM), Decision Tree(DT), Naïve Bayesian(NB) classification

中图分类号: