作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (4): 160-167. doi: 10.19678/j.issn.1000-3428.0067700

• 人工智能与模式识别 • 上一篇    下一篇

基于双向注意力机制的多模态关系抽取

吴海鹏1,2, 钱育蓉1,2,3, 冷洪勇2,3   

  1. 1. 新疆大学信息科学与工程学院, 新疆 乌鲁木齐 830046;
    2. 新疆维吾尔自治区信号检测与处理重点实验室, 新疆 乌鲁木齐 830046;
    3. 新疆大学软件学院, 新疆 乌鲁木齐 830046
  • 收稿日期:2023-05-24 修回日期:2023-07-19 发布日期:2023-08-14
  • 通讯作者: 吴海鹏,E-mail:1254812107@qq.com E-mail:1254812107@qq.com
  • 基金资助:
    国家自然科学基金(61966035,62266043);国防科工局重大专项(95-Y50G37-9001-22/23)。

Multimodal Relation Extraction Based on Bidirectional Attention Mechanism

WU Haipeng1,2, QIAN Yurong1,2,3, LENG Hongyong2,3   

  1. 1. College of Information Science and Engineering, Xinjiang University, Urumqi 830046, Xinjiang, China;
    2. Key Laboratory of Signal Detection and Processing of Xinjiang Uygur Autonomous Region, Urumqi 830046, Xinjiang, China;
    3. College of Software, Xinjiang University, Urumqi 830046, Xinjiang, China
  • Received:2023-05-24 Revised:2023-07-19 Published:2023-08-14

摘要: 传统关系抽取方法从纯文本中识别实体对之间的关系,多模态关系抽取方法通过利用多种模态信息辅助关系抽取任务。针对现有多模态关系抽取模型在处理图像数据时存在容易受到冗余信息干扰的问题,提出一种基于双向注意力机制的多模态关系抽取模型。首先,采用来自Transformer的双向编码器表示(BERT)与场景图生成模型分别提取文本语义特征与图像语义特征。然后,利用双向注意力机制建立图像到文本与文本到图像的双向对齐机制,通过这种双向对齐机制实现图像与文本之间的双向信息交互,赋予图像中冗余信息较低的权重以削弱其对文本语义表示的干扰,从而减轻图像中冗余信息对关系抽取结果造成的负面影响。最后,将对齐后的文本特征表示与视觉特征表示相连接形成文本与图像的融合特征,通过多层感知机(MLP)计算所有关系分类的概率分数并输出预测关系。在用于神经关系提取的多模式数据集(MNRE)上的实验结果表明,该模型的精确率、召回率、F1值分别达到65.53%、69.21%与67.32%,相比于基准模型均有明显提升,具有较好的关系抽取效果。

关键词: 关系抽取, 社交网络, 冗余信息, 多模态数据, 双向注意力机制

Abstract: Conventional relation extraction methods identify the relationships between pairs of entities from plain text, whereas multimodal relation extraction methods enhance relation extraction by leveraging information from multiple modalities. To address the issue of existing multimodal relation extraction models being easily disturbed by redundant information when processing image data, this study proposes a multimodal relation extraction model based on a bidirectional attention mechanism. First, Bidirectional Encoder Representations from Transformers(BERT) and a scene graph-generation model are used to extract textual and visual semantic features, respectively. Subsequently, a bidirectional attention mechanism is employed to establish bidirectional alignment between images and text, and from text to images, thus facilitating bidirectional information exchange. This mechanism assigns lower weights to redundant information in images, thereby reducing interference to the semantic representation of text and mitigating the adverse effect of redundant information on the result of relation extraction. Finally, the aligned textual and visual feature representations are concatenated to form integrated text and image features. A Multi-Layer Perceptron(MLP) is used to calculate the probability scores for all relation classifications and output the predicted relations. Experimental results on a Multimodal dataset for Neural Relation Extraction(MNRE) show that the model achieves precision, recall, and F1 scores of 65.53%, 69.21%, and 67.32%, respectively, which are significantly higher than those of baseline models, thus demonstrating its effective improvement in relation extraction.

Key words: relation extraction, social network, redundant information, multimodal data, bidirectional attention mechanism

中图分类号: