基于注意力模态融合的多模态意图识别

doi:10.19678/j.issn.1000-3428.0069955

摘要/Abstract

摘要： 意图识别是自然语言理解的一项重要任务，以往的意图识别研究主要关注于特定任务的单模态意图识别。然而，在现实世界的场景中，人类的意图是复杂的，需要通过整合诸如语言、语调、表情和动作等信息来判断。因此提出以注意力为主的多模态融合的意图识别方法，用于在真实世界的多模态场景中进行意图识别。为了能够捕捉和融合不同模态之间的长距离依赖关系、自适应地调整各模态信息的重要性和提供更丰富的表示，对每个模态特征分别使用自注意力机制。通过在每个模态的特征中添加明确的模态标识，使模型能够区分并有效融合不同模态的信息，提升整体理解和决策能力。考虑到在跨模态交互时文本模态信息的重要性，使用以跨注意力机制为核心、以文本为主导其他模态辅助引导交互的多模态融合，旨在促进文本与视觉、听觉模态之间的交互。最后对多模态意图识别的MIntRec和MIntRec2.0基准数据集进行了实验评估。结果显示，模型在准确性、精确度、召回率和F1分数等指标上均优于现有的多模态学习方法，比目前最好的基线有0.1%∼0.5％的提升。

Abstract: Intent recognition is an important task in natural language understanding. Previous research on intent recognition has mainly focused on single-modal intent recognition for specific tasks. However, in real-world scenarios, human intentions are complex and need to be inferred by integrating information such as language, tone, expressions, and actions. Therefore, a novel attention-based multimodal fusion method for intent recognition is proposed to address intent recognition in real-world multimodal scenarios. To capture and integrate long-range dependencies between different modalities, adaptively adjust the importance of information from each modality, and provide richer representations, a self-attention mechanism is used for each modality feature separately. By adding explicit modality identifiers to the data of each modality, the model can distinguish and effectively fuse information from different modalities, thereby enhancing overall understanding and decision-making capabilities. Given the importance of textual information in cross-modal interactions, a multimodal fusion method centered on a cross-attention mechanism is employed, with text as the primary modality and other modalities assisting and guiding the interactions. This approach aims to facilitate interaction between textual, visual, and auditory modalities. Finally, experiments were conducted on the MIntRec and MIntRec2.0 benchmark datasets for multimodal intent recognition. The results show that the model outperforms existing multimodal learning methods in terms of accuracy, precision, recall, and F1 score, with an improvement of 0.1% to 0.5% over the current best baseline.

苏建华, 池云仙, 许云峰, 高凯. 基于注意力模态融合的多模态意图识别[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0069955.

Su Jianhua, Chi Yunxian, Xu Yunfeng, Gao Kai. Multimodal Intent Recognition Based on Attention Modality Fusion[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0069955.

参考文献

[1] BALTRUSAITIS T, AHUJA C, MORENCY L P. Multimodal machine learning: A survey and taxonomy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(2): 423-443.
[2] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017, 30.
[3] 刘建伟, 刘俊文, 罗雄麟. 深度学习中注意力机制研究进展[J]. 工程科学学报, 2021, 43(11): 1499-1511.DOI: 10.13374/j.issn2095-9389.2021.01.30.005 Liu J W, Liu J W, Luo X L. Research progress of at tention mechanism in deep learning[J]. Journal of Eng ineering Science , 2021, 43(11): 1499-1511.DOI:10.133 74/j.issn2095-9389.2021.01.30.005. (in Chinese)
[4] SETYAWAN M Y H, AWANGGA R M, Efendi S R. Comparison of multinomial naive bayes algorithm and logistic regression for intent classification in chatbot [C]//2018 International Conference on Applied Engineering (ICAE). IEEE, 2018: 1-5.
[5] 刘娇,李艳玲,林民.人机对话系统中意图识别方法综述 [J].计算机工程与应用, 2019, 55(12):8.DOI:10.3778/j.iss n.1002-8331.1902-0129. Liu J, Li Y L, Lin M. Summary of intention recognition methods in man-machine dialogue system[J]. Computer engineering and application,2019,55(12):8.DOI:1 0.3778/j.issn.1002-8331.1902-0129. (in Chinese)
[6] YOLCHUYEVA S, NEMETH G, GYIRES-TOTH B. S elf-attention networks for intent detection[J]. arXiv pr eprint arXiv:2006.15585, 2020.
[7] WANG J, WEI K, RADFAR M, et al. Encoding syntactic knowledge in transformer encoder for intent detection and slot filling[C]//Proceedings of the AAAI C onference on Artificial Intelligence. 2021, 35(16): 139 43-13951.
[8] Liu X, Li J, Mu J, et al. Effective open intent classification with K-center contrastive learning and adjustable decision boundary[C]//Proceedings of the AAAI C onference on Artificial Intelligence. 2023, 37(11): 132 91-13299.
[9] CASANUEVA I, TEMCINAS T, GERZ D, et al. Efficient Intent Detection with Dual Sentence Encoders[C] //In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, 2020: 38-45
[10] HUANG Y, DU C, XUE Z, et al. What makes multimodal learning better than single (provably)[J]. Advances in Neural Information Processing Systems, 2021, 34: 10944-10956.
[11] 程大雷,张代玮,陈雅茜.多模态情感识别综述[J].西南民族大学自然科学版,2022,48(4):440-447 Cheng D L, Zhang D W, Chen Y Q. A summary of multimodal emotion recognition[J]. Southwest University for Nationalities Natural Science Edition, 2022,48 (4):440-447. (in Chinese)
[12] HASAN M K, LEE S, RAHMAN W, et al. Humor knowledge enriched transformer for understanding multimodal humor[C]//Proceedings of the AAAI conference on artificial intelligence. 2021, 35(14): 12972-1298 0.
[13] ZHANG H, XU H, WANG X, et al. Mintrec: A new dataset for multimodal intent recognition[C]//Proceedings of the 30th ACM International Conference on Multimedia. 2022: 1688-1697.
[14] ZHAN L M, LIANG H, LIU B, et al. Out-of-scope i ntent detection with self-supervision and discriminative training[J]. arXiv preprint arXiv:2106.08616, 2021.
[15] ZHOU Y, LIU P, QIU X. KNN-contrastive learning for out-of-domain intent classification[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022: 5129-5141.
[16] GANDHI A, ADHVARYU K, PORIA S, et al. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions[J]. Information Fusion, 2023, 91: 424-444.
[17] HAZARIKA D, ZIMMERMANN R, PORIA S. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM international conference on multimedia. 2020: 1 122-1131.
[18] ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[J]. arXiv preprint arXiv:1707.07250, 2017.
[19] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality -specific factors[J]. arXiv preprint arXiv:1806.00064, 2 018.
[20] TSAI Y H H, BAI S, LIANG P P, et al. Multimodalt ransformer for unaligned multimodal language sequences[C]//Proceedings of the conference. Association for computational linguistics. Meeting. NIH Public Access, 2019, 2019: 6558.
[21] RAHMAN W, HASAN M K, LEE S, et al. Integrating multimodal information in large pretrained transformers[C]//Proceedings of the conference. Association for Computational Linguistics. Meeting. NIH Public Access, 2020, 2020: 2359.
[22] HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[J]. arXiv preprint arXiv:2109.00412, 2021.
[23] YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task l-earning for multimodal sentiment analysis[C]//Proceedings of the AAAI conference on artificial intelligence. 2021, 35(12): 10790-10797.
[24] DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-tr aining of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 201 8.
[25] LIU Z, Lin Y, Cao Y, et al. Swin transformer: Hierar chical vision transformer using shifted windows[C]//Pr oceedings of the IEEE/CVF international conference o n computer vision. 2021: 10012-10022.
[26] CHEN S, WANG C, CHEN Z, et al. Wavlm: Large-s cale self-supervised pre-training for full stack speech processing[J]. IEEE Journal of Selected Topics in Sign al Processing, 2022, 16(6): 1505-1518.
[27] ZHANG H, WANG X, XU H, et al. MIntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations[J]. arXiv preprint arXiv:2403.10943, 2024.
[28] SOUJANYA P, DEVAMANYU H, NAVONIL M, et al. MELD: A Multimodal Multi-Party Dataset for Emotio n Recognition in Conversations[C]//Proceedings of the 57th Annual Meeting of the Association for Computat ional Linguistics. 2019: 527–536.

选择文件类型/文献管理软件名称

选择包含的内容