作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (12): 88-95. doi: 10.19678/j.issn.1000-3428.0066938

• 人工智能与模式识别 • 上一篇    下一篇

基于多模态知识图谱的中文跨模态实体对齐方法

王欢1, 宋丽娟1,2,*, 杜方1,2   

  1. 1. 宁夏大学 信息工程学院, 银川 750021
    2. 宁夏大数据与人工智能省部共建协同创新中心, 银川 750021
  • 收稿日期:2023-02-15 出版日期:2023-12-15 发布日期:2023-05-25
  • 通讯作者: 宋丽娟
  • 作者简介:

    王欢(1999-),女,硕士研究生,主研方向为知识图谱、大数据智能计算

    杜方,教授、博士

  • 基金资助:
    国家自然科学基金(62062058); 宁夏回族自治区重点研发计划(2021BEE03013)

Chinese Cross-modal Entity Alignment Method Based on Multi-modal Knowledge Graph

Huan WANG1, Lijuan SONG1,2,*, Fang DU1,2   

  1. 1. School of Information Engineering, Ningxia University, Yinchuan 750021, China
    2. Collaborative Innovation Center for Ningxia Big Data and Artificial Intelligence Co-founded by Ningxia Municipality and Ministry of Education, Yinchuan 750021, China
  • Received:2023-02-15 Online:2023-12-15 Published:2023-05-25
  • Contact: Lijuan SONG

摘要:

多模态数据间交互式任务的出现对综合利用不同模态的知识提出了较高的要求,因此多模态知识图谱应运而生。在多模态知识图谱的构建过程中图像与文本实体是否指代同一对象尤为重要,这要求对中文跨模态实体进行实体对齐。针对该问题,提出一种基于多模态知识图谱的中文跨模态实体对齐方法。将图像信息引入实体对齐任务,面向领域细粒度图像和中文文本,设计单双流交互预训练语言模型(CCMEA)。基于自监督学习方法,利用视觉和文本编码器提取视觉和文本特征,并通过交叉编码器进行精细建模,最终采用对比学习方法计算图像和文本实体的匹配度。实验结果表明,在MUGE和Flickr30k-CN数据集上,CCMEA模型的平均召回率(MR)相比于WukongViT-B基线模型分别提升了3.20和11.96个百分点,并在自建的TEXTILE数据集上MR达到94.3%。上述实验结果证明了该方法可以有效对齐中文跨模态实体,并且具有较高的准确性和实用性。

关键词: 多模态, 知识图谱, 实体对齐, 自监督, 纺织行业

Abstract:

Interactive tasks involving multi-modal data present advanced requirements for the comprehensive utilization of knowledge from different modalities, leading to the emergence of multi-modal knowledge graphs. When constructing these graphs, accurately determining whether image and text entities refer to the same object is particularly important for entity alignment of Chinese cross-modal entities. To address this problem, a Chinese cross-modal entity alignment method based on a multi-modal knowledge graph is proposed. Image information is introduced into the entity alignment task, and a single and dual-stream interactive pre-trained language model, namely CCMEA, is designed for domain-specific, fine-grained images and Chinese text. Utilizing a self-supervised learning method, Text-Visual features are extracted using Text-Visual Encoder, and fine-grained modeling is performed using cross-coders. Finally, a comparison learning method is employed to evaluate the degree of alignment between image and text entities. The experimental results show that the Mean Recall(MR) of the CCMEA model improved by 3.20 and 11.96 percentage points compared to that of the WukongViT-B baseline model on the MUGE and Flickr30k-CN datasets, respectively. Furthermore, the model achieved a remarkable MR of 94.3% on the self-built TEXTILE dataset. These results demonstrate that the proposed method can effectively align Chinese cross-modal entities with high accuracy in practical applications.

Key words: multi-modal, knowledge graph, entity alignment, self-supervision, textile industry