作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (7): 360-371. doi: 10.19678/j.issn.1000-3428.0068187

• 开发研究与工程应用 • 上一篇    下一篇

基于图结构聚类的自监督学习疾病诊断方法

张正康1, 杨丹1,*(), 聂铁铮2, 寇月2   

  1. 1. 辽宁科技大学计算机与软件工程学院, 辽宁 鞍山 114051
    2. 东北大学计算机科学与工程学院, 辽宁 沈阳 110169
  • 收稿日期:2023-08-04 出版日期:2024-07-15 发布日期:2023-11-14
  • 通讯作者: 杨丹
  • 基金资助:
    国家自然科学基金(62072084); 国家自然科学基金(62072086); 辽宁省教育厅科学研究项目(LJKMZ20220646)

Self-Supervised Learning Based on Graph Structural Clustering for Disease Diagnosis Method

Zhengkang ZHANG1, Dan YANG1,*(), Tiezheng NIE2, Yue KOU2   

  1. 1. School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan 114051, Liaoning, China
    2. School of Computer Science and Engineering, Northeastern University, Shenyang 110169, Liaoning, China
  • Received:2023-08-04 Online:2024-07-15 Published:2023-11-14
  • Contact: Dan YANG

摘要:

图自监督学习方法近年来被应用于疾病诊断任务中以缓解医疗标签信息缺乏和人工标注问题。然而, 图自监督学习的性能主要依赖于高质量的正样本和负样本, 这限制了疾病诊断的灵活性和泛用性。此外, 在构建医疗异构属性图时没有充分利用病人的多模态数据, 影响了疾病诊断的性能。提出一个基于医疗异构属性图结构聚类的自监督学习疾病诊断框架SC4DD。该框架利用病人的结构化数据和非结构化临床文本摘要构建医疗异构属性图, 通过图上的结构聚类算法生成节点的伪标签。考虑到不同元路径对学习病人嵌入表示的重要性以及不同模态医疗数据对疾病诊断结果的影响程度, 引入注意力机制的异构图神经网络作为编码器, 伪标签作为自监督信号辅助编码器学习注意力系数和病人嵌入表示。在MIMIC-Ⅲ数据集上的实验结果表明, SC4DD优于传统基线方法, 能够有效提高疾病诊断的性能。其中, 相较于性能最优的基线方法HeCo, SC4DD在2%、3%、4%标记节点下的宏平均F1值分别提高了1.46%、0.97%、0.94%, 微平均F1值分别提高了0.91%、0.84%、0.52%。

关键词: 疾病诊断, 电子病历, 图自监督学习, 图神经网络, 医疗异构属性图

Abstract:

Recently, graph self-supervised learning has been applied to disease diagnosis to alleviate the lack of medical labeling information and manual labeling problems. However, the performance of existing graph self-supervised learning heavily relies on high-quality positive and negative samples, which limits the flexibility and generalizability of disease diagnosis. Moreover, the full potential of patients' multi-modal data is not adequately utilized in constructing medical heterogeneous attributed graphs, which affects the performance of disease diagnosis. Therefore, this study proposes a framework called self-supervised learning based on the Structural Clustering of a medical heterogeneous attributed graph for Disease Diagnosis (SC4DD). This framework uses medical structured data and unstructured medical text to construct a medical heterogeneous attributed graph, and generates pseudo-labels for nodes using a structural clustering algorithm on the graph. Considering the different levels of importance of the different meta-paths for learning patient representations and the different impacts of different model medical data on the diagnosis results, a heterogeneous Graph Neural Network (GNN) with an attention mechanism is introduced as an encoder. Pseudo-labels are used as self-supervised signals to assist the encoder in learning the attention coefficients and patient representations. Experimental results on the MIMIC-Ⅲ dataset show that SC4DD outperforms other baselines and effectively improves the disease-diagnosis performance. In particular, compared to the optimal performance baseline method (HeCo), SC4DD achieves improvements of 1.46%, 0.97%, and 0.94% in the Macro-F1 scores, along with improvements of 0.91%, 0.84%, and 0.52% in the Micro-F1 scores, for 2%, 3%, and 4% of labeled nodes.

Key words: disease diagnosis, Electronic Medical Records (EMR), graph self-supervised learning, Graph Neural Network (GNN), medical heterogeneous attributed graph