[1] He K M, Zhang X Y, Ren S Q, et al. Deep residual
learning for image recognition [C]//Proceedings of the
IEEE Conference on Computer Vision and Pattern
Recognition. Las Vegas, NV, USA: IEEE Computer
Society, 2016: 770-778.
[2] Hershey S, Chaudhuri S, Ellis D P W, et al. CNN
architectures for large-scale audio classification [C]//2017
IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). New Orleans, LA, USA:
IEEE, 2017: 131-135.
[3] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of
deep bidirectional transformers for language
understanding [C]//Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers).
Minneapolis, MN, USA: Association for Computational
Linguistics, 2019: 4171-4186.
[4] Li L H, Li W, Li J, et al. VisualBERT: A simple and
performant baseline for vision and language:
arXiv:1908.03557 [OL]. 2019 [2025-06-10].
https://arxiv.org/abs/1908.03557.
[5] Liu Z, Shen Y, Lakshminarasimhan V B, et al. Efficient
low-rank multimodal fusion with modality-specific factors:
arXiv:1806.00064 [OL]. 2018 [2025-06-10].
https://arxiv.org/abs/1806.00064.
[6] Zadeh A, Chen M, Poria S, et al. Tensor fusion network
for multimodal sentiment analysis: arXiv:1707.07250
[OL]. 2017 [2025-06-10].
https://arxiv.org/abs/1707.07250.
[7] Arevalo J, Ryoo M S, Chang S F. Gated multimodal units
for information fusion [OL]. arXiv preprint
arXiv:1702.01992, 2017.
[2025-06-10]. https://arxiv.org/abs/1702.01992.
[8] Vaswani A, Shazeer N, Parmar N, et al. Attention is all
you need [C]//Advances in Neural Information Processing
Systems. Long Beach, CA, USA: Curran Associates, Inc,
2017: 6000-6010.
[9] Wang Y S, Li X, Morency L P. Words can shift:
Dynamically adjusting word representations using
nonverbal behaviors [C]//Proceedings of the AAAI
Conference on Artificial Intelligence. Honolulu, HI, USA:
AAAI Press, 2019: 7216-7223.
[10] Pham H, Tran T, Huynh T, et al. Found in translation:
Learning robust joint representations by cyclic translations
between modalities [C]//Proceedings of the AAAI
Conference on Artificial Intelligence. Honolulu, HI, USA:
AAAI Press, 2019: 6892-6899.
[11] Chen C X, Zhang P Y. Modality-collaborative transformer
with hybrid feature reconstruction for robust emotion
recognition [J]. ACM Transactions on Multimedia
Computing, Communications and Applications, 2024,
20(5): 1-23.
[12] Gu A, Dao T. Mamba: Linear-time sequence modeling
with selective state spaces [OL]. arXiv preprint
arXiv:2312.00752, 2023.
[2025-06-10]. https://arxiv.org/abs/2312.00752.
[13] Tsai Y H H, Bai S, Liang P P, et al. Multimodal
transformer for unaligned multimodal language sequences
[C]//Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics. Florence, Italy:
Association for Computational Linguistics, 2019:
6558-6569.
[14] Hazarika D, Zimmermann R, Poria S. MISA:
Modality-invariant and-specific representations for
multimodal sentiment analysis [C]//Proceedings of the
28th ACM International Conference on Multimedia. New
York, USA: Association for Computing Machinery, 2020:
1122-1131.
[15] Chauhan D S, Goel A, Bhattacharyya P. Context-aware
interactive attention for multi-modal sentiment and
emotion analysis [C]//Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP). Hong Kong,
China: Association for Computational Linguistics, 2019:
5647-5657.
[16] Wang P, Zhou Q, Wu Y, et al. DLF:
Disentangled-language-focused multimodal sentiment
analysis [C]//Proceedings of the AAAI Conference on
Artificial Intelligence. Washington, DC, USA: AAAI
Press, 2025: 21180-21188.
[17] 彭李湘松,张著洪。基于三角形特征融合与感知注意力
的方面级情感分析 [J/OL]. 计算机工程,2025, 1-10
[2025-06-10].https://doi.org/10.19678/j.issn.1000-3428.00
70397.
Peng Lixiangsong, Zhang Zhuhong. Aspect-level
sentiment analysis based on triangular feature fusion and
perceptual attention [J/OL]. Computer Engineering, 2025,
1
10
[2025-06-10].
https://doi.org/10.19678/j.issn.1000-3428.0070397.
[18] Sun Z K, Zhang Y, Wang Z, et al. Learning relationships
between text, audio, and video via deep canonical
correlation
for
multimodal
language
analysis
[C]//Proceedings of the AAAI Conference on Artificial
Intelligence. New York, USA: AAAI Press, 2020:
8992-8999.
[19] Rahman W, Hasan M K, Lee S, et al. Integrating
multimodal information in large pretrained transformers
[C]//Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics. Online:
Association for Computational Linguistics, 2020:
2359-2369.
[20] 孙明龙,欧阳纯萍,刘永彬,等。基于分层融合策略和
上下文信息嵌入的多模态情绪识别 [J]. 北京大学学报
( 自 然 科 学 版 ), 2024, 60 (03): 393-402.
DOI:10.13209/j.0479-8023.2024.034.
Sun Minglong, Ouyang Chunping, Liu Yongbin, et al.
Multimodal emotion recognition based on hierarchical
fusion strategy and contextual information embedding [J].
Journal of Peking University (Natural Science Edition),
2024,
60
(03):
393
DOI:10.13209/j.0479-8023.2024.034. – 402.
[21] 陈巧红,孙佳锦,漏杨波,等。基于多任务学习与层叠
Transformer 的多模态情感分析模型 [J]. 浙江大学学
报 ( 工 学版 ), 2023, 57 (12): 2421-2429.
Chen Qiaohong, Sun Jiajin, Lou Yangbo, et al.
Multimodal sentiment analysis model based on multitask
learning and stacked Transformer [J]. Journal of Zhejiang
University (Engineering Science Edition), 2023, 57 (12):
2421–2429.
[22] Zhang Y, Zhong H, Chen G, et al. Multimodal Sentiment
Analysis Network Based on Distributional Transformation
and Gated Cross-Modal Fusion [C]//2024 International
Conference on Networking and Network Applications
(NaNA). New York, USA: IEEE, 2024: 496-503.
[23] Liu, Shuai, Zhe Luo, and Weina Fu. "Fcdnet: fuzzy
cognition-based dynamic fusion network for multimodal
sentiment analysis." IEEE Transactions on Fuzzy Systems
33.1 (2024): 3-14.
[24] Yang, Liang. "A dynamic weighted fusion model for
multimodal sentiment analysis." Signal, Image and Video
Processing 19.8 (2025): 609.
[25] Wu T, Peng J, Zhang W, et al. Video sentiment analysis
with bimodal information-augmented multi-head attention
[J]. Knowledge-Based Systems, 2022, 235: 107676.
[26] Liu Z, Shen Y, Lakshminarasimhan V B, et al. Efficient
low-rank multimodal fusion with modality-specific factors
[C]//Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1:
Long Papers). Melbourne, Australia: Association for
Computational Linguistics, 2018: 2247-2256.
[27] Huang J, Zhou J, Tang Z, et al. TMBL: Transformer-based
multimodal binding learning model for multimodal
sentiment analysis [J]. Knowledge-Based Systems, 2024,
285: 111346.
[28] Li M, Zhu Z, Li K, et al. Joint training strategy of
unimodal and multimodal for multimodal sentiment
analysis [J]. Image and Vision Computing, 2024, 149:
105172.
[29] Zadeh A, Liang P P, Mazumder N, et al. Memory fusion
network
for
multi-view
sequential
learning
[C]//Proceedings of the AAAI Conference on Artificial
Intelligence. New Orleans, LA, USA: AAAI Press, 2018,
32(1).
[30] Wang D, Guo X, Tian Y, et al. TETFN: a text enhanced
transformer fusion network for multimodal sentiment
analysis [J]. Pattern Recognition, 2023, 136: 109259.
[31] Yu W, Xu H, Yuan Z, et al. Learning modality-specific
representations with self-supervised multi-task learning
for multimodal sentiment analysis [C]//Proceedings of the
AAAI Conference on Artificial Intelligence. New York,
USA: AAAI Press, 2021, 35(12): 10790-10797.
[32] Lin R, Hu H. Multimodal contrastive learning via
unimodal coding and cross-modal prediction for
multimodal sentiment analysis [C]//Findings of the
Association for Computational Linguistics: EMNLP 2022.Online: Association for Computational Linguistics, 2022:
522-523.1.1
[33] Wang P, Zhou Q, Wu Y, et al. DLF:
Disentangled-language-focused multimodal sentiment
analysis [C]//Proceedings of the AAAI Conference on
Artificial Intelligence. Washington, DC, USA: AAAI
Press, 2025: 21180-21188.
[34] Wang L, Peng J, Zheng C, et al. A cross modal
hierarchical fusion multimodal sentiment analysis method
based on multi-task learning[J]. Information Processing &
Management, 2024, 61(3): 103675.
[35] Tsai Y H H, Bai S, Liang P P, et al. Multimodal
transformer for unaligned multimodal language sequences
[C]//Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics. Florence, Italy:
Association for Computational Linguistics, 2019:
6558-6569.
|