作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (1): 296-305. doi: 10.19678/j.issn.1000-3428.0066550

• 开发研究与工程应用 • 上一篇    下一篇

基于多模态学习的乳腺癌生存预测研究

曹广硕*(), 黄瑞章, 陈艳平, 秦永彬   

  1. 贵州大学计算机科学与技术学院公共大数据国家重点实验室, 贵州 贵阳 550025
  • 收稿日期:2022-12-19 出版日期:2024-01-15 发布日期:2024-01-11
  • 通讯作者: 曹广硕
  • 基金资助:
    国家自然科学基金(62066007)

Research on Breast Cancer Survival Prediction Based on Multi-Modal Learning

Guangshuo CAO*(), Ruizhang HUANG, Yanping CHEN, Yongbin QIN   

  1. State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang 550025, Guizhou, China
  • Received:2022-12-19 Online:2024-01-15 Published:2024-01-11
  • Contact: Guangshuo CAO

摘要:

乳腺癌是最常见的癌症之一,基于患者的基因组学数据进行预后5年生存预测是乳腺癌研究中的常见任务。针对乳腺癌患者基因组学数据中存在嘈杂性、异质性、序列长以及正负样本不平衡等问题,提出基于多模态学习的乳腺癌预后5年生存预测模型MLBSP。利用单模态模块提取基因表达数据、基因突变累积数、单核苷酸变异以及基因拷贝数变异数4种模态数据的有效信息。在此基础上,为了减少单一模态数据异质性对全局特征造成的影响,将深度可分离卷积和多头自注意力机制作为多模态模块架构对数据进行特征融合,捕获患者多模态基因组数据的全局信息,并使用Focal Loss解决正负样本不平衡的问题,以指导预后5年生存预测。实验结果表明,MLBSP模型在乳腺癌患者真实数据集BRCA Cell、METABRIC、PanCancer Altas上的AUC分别达到91.18%、71.49%、77.37%,与XGBoost、随机森林等主流癌症生存预测方法相比,平均提升了17.69%、6.51%、10.24%。此外,通过通路分析发现一些生物标志物SLC8A3、TP 53等,进一步验证多模态研究的新颖性和有效性。

关键词: 乳腺癌, 基因组学, 深度学习, 深度可分离卷积, 多头自注意力, 多模态学习

Abstract:

Breast cancer is one of the most common cancers. Predicting 5-year survival based on patient genomics data is a common task in breast cancer research. To address the problems of noise, heterogeneity, long sequences, and the imbalance of positive and negative samples in genomics data from breast cancer patients, a 5-year survival prediction MLBSP model for breast cancer prognosis based on multi-modal learning is proposed. The model uses a single-modal module to extract effective information from four modes of data: gene expression data, the cumulative number of gene mutations, single nucleotide variations, and copy number variations. To reduce the impact of the heterogeneity of single-mode data on global features, deep separable convolution and a multi-head self-attention mechanism are used as the multi-modal module architecture to fuse the data features, capture the global information of patients' multi-modal genome data, and use Focal Loss to solve the problem of the imbalance between positive and negative samples, to guide the 5-year survival prediction. The experimental results showed that the Area Under the Curve (AUC) of the MLBSP model for data from BRCA Cell, METABRIC, and PanCancer Altas, which are real data sets from breast cancer patients, reached 91.18%, 71.49%, and 77.37%, respectively. The AUC of the MLBSP model is 17.69%, 6.51%, and 10.24% higher on average than the AUCs of XGBoost, random forest, and other mainstream cancer survival prediction methods, respectively. Pathway analysis identified some biomarkers, such as SLC8A3 and TP 53, further demonstrating the novelty and effectiveness of multi-modal research.

Key words: breast cancer, genomics, deep learning, deep separable convolution, multi-head self-attention, multi-modal learning