Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2026, Vol. 52 ›› Issue (3): 234-242. doi: 10.19678/j.issn.1000-3428.0069955

• Multimodal Information Fusion • Previous Articles     Next Articles

Multimodal Intent Recognition Based on Attention Modality Fusion

SU Jianhua, CHI Yunxian, XU Yunfeng, GAO Kai*()   

  1. School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050091, Hebei, China
  • Received:2024-06-03 Revised:2024-07-27 Online:2026-03-15 Published:2024-11-14
  • Contact: GAO Kai

基于注意力模态融合的多模态意图识别

苏建华, 池云仙, 许云峰, 高凯*()   

  1. 河北科技大学信息科学与工程学院, 河北 石家庄 050091
  • 通讯作者: 高凯
  • 作者简介:

    苏建华(CCF学生会员),男,硕士研究生,主研方向为多模态意图识别

    池云仙,讲师

    许云峰,副教授

    高凯(通信作者),教授、博士

  • 基金资助:
    河北省自然科学基金(F2022208006); 河北省教育厅科研项目(QN2024196)

Abstract:

Intent recognition is important in natural language understanding. Previous research on intent recognition has primarily focused on single-modal intent recognition for specific tasks. However, in real-world scenarios, human intentions are complex and must be inferred by integrating information such as language, tone, expressions, and actions. Therefore, a novel attention-based multimodal fusion method is proposed to address intent recognition in real-world multimodal scenarios. To capture and integrate the long-range dependencies between different modalities, adaptively adjust the importance of information from each modality, and provide richer representations, a separate self-attention mechanism is used for each modality feature. By adding explicit modality identifiers to the data of each modality, the model can distinguish and effectively fuse information from different modalities, thereby enhancing overall understanding and decision-making capabilities. Given the importance of textual information in cross-modal interactions, a multimodal fusion method based on a cross-attention mechanism is employed, with text as the primary modality and other modalities assisting and guiding the interactions. This approach aims to facilitate interactions among textual, visual, and auditory modalities. Finally, experiments were conducted on the MIntRec and MIntRec2.0 benchmark datasets for multimodal intent recognition. The results show that the model outperforms existing multimodal learning methods in terms of accuracy, precision, recall, and F1 score, with an improvement of 0.1 to 0.5 percentage points over the current best baseline model.

Key words: intent recognition, multimodal fusion, cross-attention mechanism, self-attention mechanism, text-guided interaction

摘要:

意图识别是自然语言理解的一项重要任务, 传统的意图识别研究主要关注于特定任务的单模态意图识别。然而, 在现实世界的场景中, 人类的意图是复杂的, 需要通过整合诸如语言、语调、表情和动作等信息来判断。提出以注意力为主的多模态融合的意图识别方法, 用于在真实世界的多模态场景中进行意图识别。为了能够捕捉和融合不同模态之间的长距离依赖关系, 自适应地调整各模态信息的重要性和提供更丰富的表示, 对每个模态特征分别使用自注意力机制。通过在每个模态的特征中添加明确的模态标识, 使模型能够区分并有效融合不同模态的信息, 提升整体理解和决策能力。考虑到在跨模态交互时文本模态信息的重要性, 使用以跨注意力机制为核心、以文本为主导其他模态辅助交互引导的多模态融合, 旨在促进文本与视觉、听觉模态之间的交互。最后对多模态意图识别的MIntRec和MIntRec2.0基准数据集进行了实验评估。结果显示, 该方法在准确性、精确度、召回率和F1值等指标上均优于现有的多模态学习方法, 比目前最好的基线方法提升0.1~0.5百分点。

关键词: 意图识别, 多模态融合, 跨注意力机制, 自注意力机制, 文本交互引导