作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于图结构聚类的自监督学习疾病诊断方法

  • 发布日期:2023-11-14

Self-Supervised Learning Based on Graph Structural Clustering for Disease Diagnosis

  • Published:2023-11-14

摘要: 近年来,图自监督学习方法被应用于疾病诊断任务中以缓解医疗标签信息缺乏和人工标注问题。然而,图自监督学习的性能主要依赖于高质量的正样本和负样本,这限制了疾病诊断的灵活性和泛用性。此外,在构建医疗异构属性图时没有充分利用病人的多模态数据,这影响了疾病诊断的性能。因此,提出一个基于医疗异构属性图结构聚类的自监督学习疾病诊断框架SC4DD(self-supervised learning based on structural clustering of medical attributed heterogeneous graph for disease diagnosis)。该框架利用病人的结构化数据和非结构化临床文本摘要构建医疗异构属性图,通过图上的结构聚类算法生成节点的伪标签。考虑到不同元路径对学习病人嵌入表示的重要性不同和不同模态医疗数据对诊病诊断结果的影响程度不同,引入注意力机制的异构图神经网络作为编码器,伪标签作为自监督信号辅助编码器学习注意力系数和病人嵌入表示。在MIMIC-III数据集上的实验结果表明,SC4DD优于其它基线方法,能够有效提高疾病诊断的性能。其中,相较于性能最优的基线方法HeCo,SC4DD在不同标记节点所占百分比下的Macro-F1分别提高了1.46%、0.97%、0.94%,Micro-F1分别提高了0.91%,0.84%,0.52%。

Abstract: Recently, graph self-supervised learning has been applied to disease diagnosis task to alleviate the lack of medical labeling information and manual labeling problems. However, the performance of existing graph self-supervised learning heavily rely on high-quality positive and negative samples, which limits the flexibility and generalizability of disease diagnosis. Moreover, the full potential of patients' multi-modal data is not adequately utilized in constructing medical attributed heterogeneous graph, which affects the performance of disease diagnosis. Therefore, we proposed a framework named self-supervised learning based on structural clustering of medical attributed heterogeneous graph for disease diagnosis (SC4DD). The framework uses medical structured data and unstructured medical text to construct medical attributed heterogeneous graph, and generates pseudo-labels of nodes by structural clustering algorithm on the graph. Considering the different importance of different meta-paths for learning patient representations and the different impact of different model medical data on the diagnosis results, a heterogeneous graph neural network with attention mechanism is introduced as an encoder. Pseudo-labels are used as self-supervised signals to assist the encoder in learning attention coefficients and patient representations. Experimental results on the MIMIC-III dataset show that SC4DD outperforms other baselines and effectively improves the performance of disease diagnosis. In particular, compared to the optimal performance baseline method HeCo, SC4DD achieved improvements of 1.46%, 0.97%, and 0.94% in Macro-F1, and improved Micro-F1 by 0.91%, 0.84%, and 0.52%, respectively, for different percentages of labeled nodes.