基于注意力模态融合的多模态意图识别

doi:10.19678/j.issn.1000-3428.0069955

摘要/Abstract

摘要：

意图识别是自然语言理解的一项重要任务, 传统的意图识别研究主要关注于特定任务的单模态意图识别。然而, 在现实世界的场景中, 人类的意图是复杂的, 需要通过整合诸如语言、语调、表情和动作等信息来判断。提出以注意力为主的多模态融合的意图识别方法, 用于在真实世界的多模态场景中进行意图识别。为了能够捕捉和融合不同模态之间的长距离依赖关系, 自适应地调整各模态信息的重要性和提供更丰富的表示, 对每个模态特征分别使用自注意力机制。通过在每个模态的特征中添加明确的模态标识, 使模型能够区分并有效融合不同模态的信息, 提升整体理解和决策能力。考虑到在跨模态交互时文本模态信息的重要性, 使用以跨注意力机制为核心、以文本为主导其他模态辅助交互引导的多模态融合, 旨在促进文本与视觉、听觉模态之间的交互。最后对多模态意图识别的MIntRec和MIntRec2.0基准数据集进行了实验评估。结果显示, 该方法在准确性、精确度、召回率和F1值等指标上均优于现有的多模态学习方法, 比目前最好的基线方法提升0.1~0.5百分点。

关键词: 意图识别, 多模态融合, 跨注意力机制, 自注意力机制, 文本交互引导

Abstract:

Intent recognition is important in natural language understanding. Previous research on intent recognition has primarily focused on single-modal intent recognition for specific tasks. However, in real-world scenarios, human intentions are complex and must be inferred by integrating information such as language, tone, expressions, and actions. Therefore, a novel attention-based multimodal fusion method is proposed to address intent recognition in real-world multimodal scenarios. To capture and integrate the long-range dependencies between different modalities, adaptively adjust the importance of information from each modality, and provide richer representations, a separate self-attention mechanism is used for each modality feature. By adding explicit modality identifiers to the data of each modality, the model can distinguish and effectively fuse information from different modalities, thereby enhancing overall understanding and decision-making capabilities. Given the importance of textual information in cross-modal interactions, a multimodal fusion method based on a cross-attention mechanism is employed, with text as the primary modality and other modalities assisting and guiding the interactions. This approach aims to facilitate interactions among textual, visual, and auditory modalities. Finally, experiments were conducted on the MIntRec and MIntRec2.0 benchmark datasets for multimodal intent recognition. The results show that the model outperforms existing multimodal learning methods in terms of accuracy, precision, recall, and F1 score, with an improvement of 0.1 to 0.5 percentage points over the current best baseline model.

Key words: intent recognition, multimodal fusion, cross-attention mechanism, self-attention mechanism, text-guided interaction

苏建华, 池云仙, 许云峰, 高凯. 基于注意力模态融合的多模态意图识别[J]. 计算机工程, 2026, 52(3): 234-242.

SU Jianhua, CHI Yunxian, XU Yunfeng, GAO Kai. Multimodal Intent Recognition Based on Attention Modality Fusion[J]. Computer Engineering, 2026, 52(3): 234-242.

https://www.ecice06.com/CN/Y2026/V52/I3/234

图/表 11

图1 多模态意图识别示例

Fig.1 Example of multimodal intent recognition

图2 模型整体架构

Fig.2 Overall architecture of the model

图3 多头注意力模块

Fig.3 Multi-head attention module

图4 MIntRec数据集上不同方法的结果比较

Fig.4 Comparison of the results of different methods on the MIntRec dataset

图5 MIntRec2.0数据集上不同方法的结果比较

Fig.5 Comparison of the results of different methods on the MIntRec2.0 dataset

图6 MELD-DA数据集上不同方法的结果比较

Fig.6 Comparison of the results of different methods on the MELD-DA dataset

参考文献 28

1	BALTRUSAITIS T , AHUJA C , MORENCY L P . Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41 (2): 423- 443. doi: 10.1109/TPAMI.2018.2798607
2	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 30-38.
3	刘建伟, 刘俊文, 罗雄麟. 深度学习中注意力机制研究进展. 工程科学学报, 2021, 43 (11): 1499- 1511.
	LIU J W , LIU J W , LUO X L . Research progress of attention mechanism in deep learning. Journal of Engineering Science, 2021, 43 (11): 1499- 1511.
4	HELMI SETYAWAN M Y, AWANGGA R M, EFENDI S R. Comparison of multinomial naive Bayes algorithm and logistic regression for intent classification in chatbot[C]//Proceedings of the International Conference on Applied Engineering. Washington D. C., USA: IEEE Press, 2018: 1-5.
5	刘娇, 李艳玲, 林民. 人机对话系统中意图识别方法综述. 计算机工程与应用, 2019, 55 (12): 1- 7.
	LIU J , LI Y L , LIN M . Summary of intention recognition methods in man-machine dialogue system. Computer Engineering and Application, 2019, 55 (12): 1- 7.
6	YOLCHUYEVA S, NEMETH G, GYIRES-TOTH B. Self-attention networks for intent detection[EB/OL]. [2024-04-30]. https://arxiv.org/pdf/2006.15585.
7	WANG J X , WEI K , RADFAR M , et al. Encoding syntactic knowledge in transformer encoder for intent detection and slot filling. Artificial Intelligence, 2021, 35 (16): 13943- 13951.
8	LIU X K , LI J Q , MU J J , et al. Effective open intent classification with K-center contrastive learning and adjustable decision boundary. Artificial Intelligence, 2023, 37 (11): 13291- 13299.
9	CASANUEVA I, TEM AČG INAS T, GERZ D, et al. Efficient intent detection with dual sentence encoders[C]//Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI. Stroudsburg, USA: ACL Press, 2020: 38-45.
10	HUANG Y, DU C, XUE Z, et al. What makes multi-modal learning better than single(provably)[C]//Proceedings of the Advances in Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021, 34: 10944-10956.
11	程大雷, 张代玮, 陈雅茜. 多模态情感识别综述. 西南民族大学(自然科学版), 2022, 48 (4): 440- 447.
	CHENG D L , ZHANG D W , CHEN Y Q . A summary of multimodal emotion recognition. Journal of Southwest Minzu University(Natural Science Edition), 2022, 48 (4): 440- 447.
12	HASAN M K , LEE S W , RAHMAN W , et al. Humor knowledge enriched transformer for understanding multimodal humor. Artificial Intelligence, 2021, 35 (14): 12972- 12980.
13	ZHANG H L, XU H, WANG X, et al. MIntRec: a new dataset for multimodal intent recognition[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York, USA: ACM Press, 2022: 1688-1697.
14	ZHAN L M, LIANG H, LIU B, et al. Out-of-scope intent detection with self-supervision and discriminative training[EB/OL]. [2024-04-30]. https://arxiv.org/pdf/2106.08616.
15	ZHOU Y H, LIU P J, QIU X P. KNN-contrastive learning for out-of-domain intent classification[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL Press, 2022: 5129-5141.
16	GANDHI A , ADHVARYU K , PORIA S , et al. Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion, 2023, 91, 424- 444. doi: 10.1016/j.inffus.2022.09.025
17	HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: Modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York, USA: ACM Press, 2020: 1122-1131.
18	ZADEH A, CHEN M, PORIA S, et al. Tensor fusionnetwork for multimodal sentiment analysis[EB/OL]. [2024-04-30]. https://arxiv.org/pdf/1707.07250.pdf.
19	LIU Z, SHEN Y, LAKSHMINARASIMHAN V, et al. Efficient low-rank multimodal fusion with modality-specific factors[EB/OL]. [2024-04-30]. https://arxiv.org/pdf/1806.00064.
20	TSAI Y H, BAI S J, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL Press, 2019: 6558.
21	RAHMAN W, HASAN M K, LEE S W, et al. Integrating multimodal information in large pretrained transformers[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL Press, 2020: 2359.
22	HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[EB/OL]. [2024-04-30]. https://arxiv.org/pdf/2109.00412.
23	YU W M , XU H , YUAN Z Q , et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. Artificial Intelligence, 2021, 35 (12): 10790- 10797.
24	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2024-04-30]. https://arxiv.org/pdf/1810.04805.
25	LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2021: 10012-10022.
26	CHEN S Y , WANG C Y , CHEN Z Y , et al. WavLM: large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 2022, 16 (6): 1505- 1518. doi: 10.1109/JSTSP.2022.3188113
27	ZHANG H L, WANG X, XU H, et al. MIntRec2.0: a large-scale benchmark dataset for multimodal intent recognition and out-of-scope detection in conversations [EB/OL]. [2024-04-30]. https://arxiv.org/pdf/2403.10943.
28	PORIA S, HAZARIKA D, MAJUMDER N, et al. MELD: a multimodal multi-party dataset for emotion recognition in conversations[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL Press, 2019: 527-536.

[1]	刘啸宇, 廖志芳, 谈遂, 余志武. 基于堆叠GRU神经网络的桥梁动应变预测[J]. 计算机工程, 2026, 52(3): 441-450.
[2]	翟志鹏, 曹阳, 沈琴琴, 施佺. 基于多时空图融合与动态注意力的交通流预测[J]. 计算机工程, 2025, 51(9): 139-148.
[3]	刘云翔, 梁智超. 一种高效的连续时序图注意力网络的交通预测模型[J]. 计算机工程, 2025, 51(4): 350-359.
[4]	孙亭, 杨洁, 李家璇, 王耀宗. 面向弱光交通场景的YOLOv7道路标志检测算法优化[J]. 计算机工程, 2025, 51(3): 342-351.
[5]	龙丽叶, 焦世超, 郭磊, 韩燮, 况立群. 基于紧凑中心的多模态三维模型检索研究[J]. 计算机工程, 2025, 51(2): 322-334.
[6]	冯赛赛, 葛东峰, 李涛, 刘一靖, 冀治航, 王琳, 张明川. 基于多模态融合的宫颈上皮内瘤变辅助诊断[J]. 计算机工程, 2025, 51(12): 304-310.
[7]	徐磊, 曾艳, 袁俊峰, 岳鲁鹏, 殷昱煜, 张纪林, 薛梅婷, 韩猛. 基于自注意力机制的时间序列插补[J]. 计算机工程, 2025, 51(11): 90-99.
[8]	周嘉文, 郑小盈, 祝永新, 林思敏, 陈凌曜, 曾洪斌, 郭俞, 王馨莹. 多头自注意力与双线性池化融合的心肌缺血影像分类[J]. 计算机工程, 2025, 51(11): 246-257.
[9]	刘洋宏, 付杨悠然, 董性平. HDMapFusion: 用于自动驾驶的多模态融合高清地图生成(特邀)[J]. 计算机工程, 2025, 51(10): 18-26.
[10]	毕然, 杨奉毅, 周喜, 杨雅婷, 艾比布拉·阿塔伍拉. 基于完形填空的小样本意图槽位联合识别方法[J]. 计算机工程, 2025, 51(10): 79-86.
[11]	刘钟, 唐宏, 王宁喆, 朱传润. 融合RNN与稀疏自注意力的文本摘要方法[J]. 计算机工程, 2025, 51(1): 312-320.
[12]	陈瀚, 赵春蕾, 蒋昊达, 王春东. 基于融合模型与语义网络的App用户意图识别研究[J]. 计算机工程, 2024, 50(8): 50-63.
[13]	王夙喆, 张雪英, 陈晓玉, 李凤莲, 吴泽林. 基于有效注意力和GAN结合的脑卒中EEG增强算法[J]. 计算机工程, 2024, 50(8): 336-344.
[14]	武星, 殷浩宇, 姚骏峰, 李卫民, 钱权. 面向视频数据的多模态情感分析[J]. 计算机工程, 2024, 50(6): 218-227.
[15]	贺姗, 蔺素珍, 王彦博, 李大威. 基于特征融合的多波段图像描述生成方法[J]. 计算机工程, 2024, 50(6): 236-244.

选择文件类型/文献管理软件名称

选择包含的内容