Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2026, Vol. 52 ›› Issue (3): 255-263. doi: 10.19678/j.issn.1000-3428.0070114

• Multimodal Information Fusion • Previous Articles     Next Articles

Research on Cyberbullying Detection Based on Multimodal Spatial Feature Fusion

CHEN Guolian1, FENG Ziyang2, CAO Junkuo1,3,*()   

  1. 1. Key Laboratory of Data Science and Smart Education, Ministry of Education, Haikou 571158, Hainan, China
    2. School of Information Science and Technology, Hainan Normal University, Haikou 571158, Hainan, China
    3. Information Network and Data Center, Hainan Normal University, Haikou 571158, Hainan, China
  • Received:2024-07-12 Revised:2024-08-30 Online:2026-03-15 Published:2024-12-03
  • Contact: CAO Junkuo

基于多模态空间特征融合的网络欺凌检测研究

陈国莲1, 冯梓洋2, 曹均阔1,3,*()   

  1. 1. 数据科学与智慧教育教育部重点实验室, 海南 海口 571158
    2. 海南师范大学信息科学技术学院, 海南 海口 571158
    3. 海南师范大学信息网络与数据中心, 海南 海口 571158
  • 通讯作者: 曹均阔
  • 作者简介:

    陈国莲,女,高级工程师、硕士,主研方向为统计机器学习

    冯梓洋,硕士研究生

    曹均阔(通信作者),研究员、博士

  • 基金资助:
    海南省自然科学基金(625MS081); 海口市科技专项(2025-008); 海南省高等学校教育教学改革研究项目(Hnjg2024ZD-19); 国家自然科学基金(61867001); 国家自然科学基金(61363032)

Abstract:

To achieve faster and wider dissemination effects, social media platforms often use multimodal information, such as text, voice, and images, to publish cyberbullying comments. Multimodal information can express the emotions of information publishers in greater detail and provide multidimensional information sources for researchers to automatically detect cyberbullying. Current multimodal network bullying speech detection models primarily focus on the complex fusion of large-scale interactive spaces and lack an analysis of potential commonalities and differences between modalities. Therefore, multimodal network bullying detection based on simple feature fusion does not achieve ideal performance, and model training is significantly time-consuming and difficult to converge. This study proposes a multimodal detection model based on spatial features to address this issue. First, features are extracted for each single mode, and then the features are fused using the hierarchical attention mechanism of the Hadamard product by constructing shared and specific feature spaces. The fusion process does not simply rely on output attention scores for simple weighting but independently reassigns attention weights so that modalities do not interfere with each other and the feature integrity of shared and specific spaces are preserved. Finally, a dual layer perceptron structure is used to detect cyberbullying speech. Results show that the model achieves good detection performance and convergence on both the CMCAD and CMU-MOSI datasets.

Key words: cyberbullying detection, multimodal learning, multimodal feature fusion, layered attention mechanism, double-layer perceptron

摘要:

社交媒体平台为了达到更快更广的传播效应, 发布网络欺凌言论往往综合利用了文本、语音和图像等多模态信息。虽然多模态信息可以更充分地表达信息发布人的情感, 但同时也为研究人员进行网络欺凌自动检测提供了多维度信息源。当前多模态网络欺凌言论检测模型多聚焦于大规模交互空间的复杂融合, 缺乏模态间潜在共性和异性的关联分析。因此, 基于简单特征融合的多模态网络欺凌检测模型性能不够理想, 而且模型的训练过程也非常耗时、不易收敛。针对这一问题, 提出一种基于空间特征的多模态检测模型。首先对各单一模态进行特征提取, 然后通过共享特征空间和特定特征空间的构建, 使用哈达玛积的分层注意力机制进行特征融合。该融合过程不是单纯依靠输出注意力分数进行简单加权, 而是独立地重新分配注意力权重, 从而使得模态之间互不干扰, 保留了共享空间和特定空间的特征完整性。最后使用双层感知机结构实现网络欺凌言论检测, 结果表明, 该模型在CMCAD和CMU-MOSI数据集上均取得了良好的检测效果和收敛性能。

关键词: 网络欺凌检测, 多模态学习, 多模态特征融合, 分层注意力机制, 双层感知机