作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (12): 304-310. doi: 10.19678/j.issn.1000-3428.0069544

• 图形图像处理 • 上一篇    下一篇

基于多模态融合的宫颈上皮内瘤变辅助诊断

冯赛赛1, 葛东峰2,*(), 李涛2, 刘一靖2, 冀治航1, 王琳1, 张明川1   

  1. 1. 河南科技大学信息工程学院, 河南 洛阳 471000
    2. 河南科技大学第一附属医院, 河南 洛阳 471000
  • 收稿日期:2024-03-12 修回日期:2024-07-18 出版日期:2025-12-15 发布日期:2024-10-10
  • 通讯作者: 葛东峰
  • 基金资助:
    国家自然科学基金(62102134); 河南省科技攻关项目(232102210048); 河南省科技攻关项目(232102211008); 河南省科技攻关项目(232102210028)

Assisted Diagnosis of Cervical Intraepithelial Neoplasia Based on Multimodal Fusion

FENG Saisai1, GE Dongfeng2,*(), LI Tao2, LIU Yijing2, JI Zhihang1, WANG Lin1, ZHANG Mingchuan1   

  1. 1. School of Information Engineering, Henan University of Science and Technology, Luoyang 471000, Henan, China
    2. The First Affiliated Hospital of Henan University of Science and Technology, Luoyang 471000, Henan, China
  • Received:2024-03-12 Revised:2024-07-18 Online:2025-12-15 Published:2024-10-10
  • Contact: GE Dongfeng

摘要:

近年来, 深度学习在医学图像处理领域的应用取得了显著进展。然而, 现有的方法大多采用图像分类的方式来辅助医生进行病理诊断, 在某些情况下可能会出现缺乏可解释性的问题。为了解决这一问题, 提出一种基于多模态特征融合的宫颈上皮内瘤变(CIN)分类模型。在这个多模态分类模型中, 患者的病理图像作为图像模态数据, 而相应的病理报告描述则作为文本模态数据。当前多模态模型在处理不同模态信息时, 大多采用简单拼接的方法组合各个模态的特征, 往往忽略了跨模态信息之间的联系。为了促进跨模态信息之间的交互, 采用一种改进的Nonlocal注意力机制, 以提高模型对数据的全面理解能力。同时, 考虑到现实中并非每张图像都有相应的病理报告, 因此在训练中采用一种文本"丢弃"的编码策略, 以确保训练好的模型在文本信息缺失时仍能够实现较高的分类准确率。实验结果表明, 该模型在CIN分类方面具有良好的性能, 相较于传统的ResNet、ConvNeXt、MobileViT等单模态方法在分类准确率方面提升了7~9百分点。

关键词: 多模态融合, 注意力机制, 病理辅助诊断, 深度学习, 宫颈上皮内瘤变

Abstract:

In recent years, deep learning has made significant progress in the field of medical image processing. Most existing methods use image classification to assist physicians in pathological diagnosis, which may suffer from a lack of interpretability. To address this problem, this study proposes a multimodal feature fusion-based classification model for Cervical Intraepithelial Neoplasia (CIN). In this multimodal classification model, a patient's pathology image is used as the image modality data, whereas the corresponding pathology report description is used as the text modality data. Most current multimodal models use a simple splicing method to combine the features of each modality when dealing with different modal information. However, this method often ignores the connection between cross-modal information. To facilitate the interaction between cross-modal information, this study adopts an improved nonlocal attention mechanism to enhance the model's ability to understand the data comprehensively. Meanwhile, considering that not every image has a corresponding pathology report, the study leverages a text-discarding coding strategy during training to ensure that the trained model can achieve high classification accuracy despite missing text information. Experimental results show that the model performs well in CIN classification, and the classification accuracy is improved by 7—9 percentage points compared to those of traditional unimodal methods such as ResNet, ConvNeXt, and MobileViT.

Key words: multimodal fusion, attention mechanism, pathology assisted diagnosis, deep learning, Cervical Intraepithelial Neoplasia (CIN)