作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2021, Vol. 47 ›› Issue (12): 316-320. doi: 10.19678/j.issn.1000-3428.0059822

• 开发研究与工程应用 • 上一篇    

面向非平衡数据的癌症患者生存预测分析

苗立志1,2, 白瑞思蒙3, 刘成良1, 翟月昊3   

  1. 1. 南京邮电大学 地理与生物信息学院, 南京 210023;
    2. 南京邮电大学 江苏省智慧健康大数据分析与位置服务工程实验室, 南京 210023;
    3. 南京邮电大学 通信与信息工程学院, 南京 210003
  • 收稿日期:2020-10-23 修回日期:2021-01-14 发布日期:2021-01-22
  • 作者简介:苗立志(1981-),男,副教授、博士,主研方向为时空大数据分析与挖掘;白瑞思蒙、刘成良、翟月昊,硕士研究生。
  • 基金资助:
    江苏省“双创博士”项目(CZ032SC20025)。

Survival Prediction Analysis of Cancer Patients Oriented to Unbalanced Data

MIAO Lizhi1,2, BAI Ruisimeng3, LIU Chengliang1, ZHAI Yuehao3   

  1. 1. College of Geographical and Biological Information, Nanjing University of Posts and Telecommunications, Nanjing 210023, China;
    2. Smart Health Big Data Analysis and Location Services Engineering Laboratory of Jiangsu Province, Nanjing University of Posts and Telecommunications, Nanjing 210023, China;
    3. College of Telecommunications & Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
  • Received:2020-10-23 Revised:2021-01-14 Published:2021-01-22

摘要: 针对癌症数据集中存在非平衡数据及噪声样本的问题,提出一种基于RENN和SMOTE算法的癌症患者生存预测算法RENN-SMOTE-SVM。基于最近邻规则,利用RENN算法减少多数类样本中噪声样本数量,并通过SMOTE算法在少数类样本间进行线性插值增加样本数量,从而获得平衡数据集。基于美国癌症数据库非平衡乳腺癌患者数据集对癌症患者的生存情况进行预测分析,实验结果表明,与SVM算法、Tomeklinks-SVM算法等5种常用算法相比,该算法的分类及预测效果更好,其正确率、F1-score、G-means值分别为0.883,0.904,0.779。

关键词: 疾病预测, 机器学习, 数据分析, 非平衡数据, SMOTE算法

Abstract: The survival analysis of cancer patients generally suffers from unbalanced data sets and noisy samples.To address the problem, this paper proposes an algorithm to predict the survival of cancer patients.The algorithm, named RENN-SMOTE-SVM, is constructed based on the RENN algorithm and the SMOTE algorithm.The RENN algorithm is used to reduce the number of noisy samples in the majority class based on the nearest neighbor rule.The SMOTE algorithm is used to linearly interpolate between the minority class samples to increase the number of samples, and finally a balanced data set is obtained.The proposed algorithm is tested by performing prediction analysis on the unbalanced data set of breast cancer patients in the American Cancer Database.The experimental results show that the RENN-SMOTE-SVM algorithm displays better classification and prediction results than SVM, Tomeklinks-SVM and other three mainstream algorithms.It provides an accuracy of 0.883, F1 score of 0.904 and G-means value of 0.779.

Key words: disease prediction, machine learning, data analysis, unbalanced data, SMOTE algorithm

中图分类号: